This past December I contributed an article called Frontend SPOF in Beijing to PerfPlanet’s Performance Calendar. I hope that everyone who reads my blog also read the Performance Calendar – it’s an amazing collection of web performance articles and gurus. But in case you don’t I’m cross-posting it here. I saw a great presentation from Pat Meenan about frontend SPOF and want to raise awareness around this issue. This post contains some good insights.
Make sure to read PerfPlanet – it’s a great aggregator of WPO blog posts.
Now – flash back to December 2011…
I’m at Velocity China in Beijing as I write this article for the Performance Calendar. Since this is my second time to Beijing I was better prepared for the challenges of being behind the Great Firewall. I knew I couldn’t access popular US websites like Google, Facebook, and Twitter, but as I did my typical surfing I was surprised at how many other websites seemed to be blocked.
Business Insider
It didn’t take me long to realize the problem was frontend SPOF – when a frontend resource (script, stylesheet, or font file) causes a page to be unusable. Some pages were completely blank, such as Business Insider:Firebug’s Net Panel shows that
anywhere.js
is taking a long time to download because it’s coming from platform.twitter.com
– which is blocked by the firewall. Knowing that scripts block rendering of all subsequent DOM elements, we form the hypothesis that anywhere.js
is being loaded in blocking mode in the HEAD. Looking at the HTML source we see that’s exactly what is happening:<head> ... <!-- Twitter Anywhere --> <script src="https://platform.twitter.com/anywhere.js?id=ZV0...&v=1" type="text/javascript"></script> <!-- / Twitter Anywhere --> ... </head> ... <body>If
anywhere.js
had been loaded asynchronously this wouldn’t happen. Instead, since anywhere.js
is loaded the old way with <SCRIPT SRC=...
, it blocks all the DOM elements that follow which in this case is the entire BODY of the page. If we wait long enough the request for anywhere.js
times out and the page begins to render. How long does it take for the request to timeout? Looking at the “after” screenshot of Business Insider we see it takes 1 minute and 15 seconds for the request to timeout. That’s 1 minute and 15 seconds that the user is left staring at a blank white screen waiting for the Twitter script!CNET
CNET has a slightly different experience; the navigation header is displayed but the rest of the page is blocked from rendering:Looking in Firebug we see that
wrapper.js
from cdn.eyewonder.com
is “pending” – this must be another domain that’s blocked by the firewall. Based on where the rendering stops our guess is that the wrapper.js
SCRIPT tag is immediately after the navigation header and is loaded in blocking mode thus preventing the rest of the page from rendering. The HTML confirms that this is indeed what’s happening:<header> ... </header> <script src="http://cdn.eyewonder.com/100125/771933/1592365/wrapper.js"></script> <div id="rb_wrap"> <div id="rb_content"> <div id="contentMain">
O’Reilly Radar
Everyday I visit O’Reilly Radar to read Nat Torkington’s Four Short Links. Normally Nat’s is one of many stories on the Radar front page, but going there from Beijing shows a page with only one story:At the bottom of this first story there’s supposed to be a Tweet button. This button is added by the
widgets.js
script fetched from platform.twitter.com
which is blocked by the Great Firewall. This wouldn’t be an issue if widgets.js
was fetched asynchronously, but sadly a peek at the HTML shows that’s not the case:<a href="...">Comment</a> | <span class="social-counters"> <span class="retweet"> <a href="http://twitter.com/share" class="twitter-share-button" data-count="horizontal" data-url="http://radar.oreilly.com/2011/12/four-short-links-6-december-20-1.html" data-text="Four short links: 6 December 2011" data-via="radar" data-related="oreillymedia:oreilly.com">Tweet</a> <script src="http://platform.twitter.com/widgets.js" type="text/javascript"></script> </span>
The cause of frontend SPOF
One possible takeaway from these examples might be that frontend SPOF is specific to Twitter and eyewonder and a few other 3rd party widgets. Sadly, frontend SPOF can be caused by any 3rd party widget, and even from the main website’s own scripts, stylesheets, or font files.Another possible takeaway from these examples might be to avoid 3rd party widgets that are blocked by the Great Firewall. But the Great Firewall isn’t the only cause of frontend SPOF – it just makes it easier to reproduce. Any script, stylesheet, or font file that takes a long time to return has the potential to cause frontend SPOF. This typically happens when there’s an outage or some other type of failure, such as an overloaded server where the HTTP request languishes in the server’s queue for so long the browser times out.
The true cause of frontend SPOF is loading a script, stylesheet, or font file in a blocking manner. The table in my frontend SPOF blog post shows when this happens. It’s really the website owner who controls whether or not their site is vulnerable to frontend SPOF. So what’s a website owner to do?
Avoiding frontend SPOF
The best way to avoid frontend SPOF is to load scripts asynchronously. Many popular 3rd party widgets do this by default, such as Google Analytics, Facebook, and Meebo. Twitter also has an async snippet for the Tweet button that O’Reilly Radar should use. If the widgets you use don’t offer an async version you can try Stoyan’s Social button BFFs async pattern.Another solution is to wrap your widgets in an iframe. This isn’t always possible, but in two of the examples above the widget is eventually served in an iframe. Putting them in an iframe from the start would have avoided the frontend SPOF problems.
For the sake of brevity I’ve focused on solutions for scripts. Solutions for font files can be found in my @font-face and performance blog post. I’m not aware of much research on loading stylesheets asynchronously. Causing too many reflows and FOUC are concerns that need to be addressed.
Call to action
Business Insider, CNET, and O’Reilly Radar all have visitors from China, and yet the way their pages are constructed delivers a bad user experience where most if not all of the page is blocked for more than a minute. This isn’t a P2 frontend JavaScript issue. This is an outage. If the backend servers for these websites took 1 minute to send back a response, you can bet the DevOps teams at Business Insider, CNET, and O’Reilly wouldn’t sleep until the problem was fixed. So why is there so little concern about frontend SPOF?Frontend SPOF doesn’t get much attention – it definitely doesn’t get the attention it deserves given how easily it can bring down a website. One reason is it’s hard to diagnose. There are a lot of monitors that will start going off if a server response time exceeds 60 seconds. And since all that activity is on the backend it’s easier to isolate the cause. Is it that pagers don’t go off when clientside page load times exceed 60 seconds? That’s hard to believe, but perhaps that’s the case.
Perhaps it’s the way page load times are tracked. If you’re looking at worldwide medians, or even averages, and China isn’t a major audience your page load time stats might not exceed alert levels when frontend SPOF happens. Or maybe page load times are mostly tracked using synthetic testing, and those user agents aren’t subjected to real world issues like the Great Firewall.
One thing website owners can do is ignore frontend SPOF until it’s triggered by some future outage. A quick calculation shows this is a scary choice. If a 3rd party widget has a 99.99% uptime and a website has five widgets that aren’t async, the probability of frontend SPOF is 0.05%. If we drop uptime to 99.9% the probability of frontend SPOF climbs to 0.5%. Five widgets might be high, but remember that “third party widget” includes ads and metrics. Also, the website’s own resources can cause frontend SPOF which brings the number even higher. The average website today contains 14 scripts any of which could cause frontend SPOF if they’re not loaded async.
Frontend SPOF is a real problem that needs more attention. Website owners should use async snippets and patterns, monitor real user page load times, and look beyond averages to 95th percentiles and standard deviations. Doing these things will mitigate the risk of subjecting users to the dreaded blank white page. A chain is only as strong as its weakest link. What’s your website’s weakest link? There’s a lot of focus on backend resiliency. I’ll wager your weakest link is on the frontend.
[Originally posted as part of PerfPlanet's Performance Calendar 2011.]