Planning to Scale

Recently at Global Radio we recently relaunched Heart FM, which now is a conglomeration of 33 local heart station websites, where previously it was 33 individual sites. So to achieve this the team refactored our inhouse CMS to handle these localisations in as sane a way as possible. With the aim being that with the CMS our editors could easily manage and share content across these sites and make it easy to localise any aspect of the page, where one or more stations wants to differ from the norm.

This resulted in a fairly large scale refactoring of the CMS to clean up and simplify the logic of the system. The end result of that work is an editor can now; localise, schedule or symlink (share across pages / locations) any aspect of the page. This work has resulted in an extremely flexible and powerful system, but also has has thrown up an number of challenges - one of which is how to make the website scale effectively and this is the focus of the post.

Targets

The first rule of scaling is to have something to aim for:

"Premature optimization is the root of all evil" - without being focused then your optimisations are likely to focus on the wrong areas.

I remember early in my career people discussing large databases and the differences in perception of what "large is" can be huge. So before you begin to optimise your site, you should at least know what you want to aim form, otherwise you are likely to waste time, effort and money and not focus on what actually matters. For Heart we know we have to deal with traffic surges similar to the "digg effect", so our aim is to be able to deilver content under this periods of extreme traffic.

Remove bottlenecks

For our sites the main bottleneck is the database. This is no surprise as, databases are often a bottleneck in dynamic websites and a number of approaches can be taken to minimise its impact. At Global we have written a flexible CMS that is complex in that it allows n levels of nesting, scheduling, localising and sharing of modules on a page. But on the whole our sites don't produce complex and dynamic output, there are elements of that, but on the whole its a magazine / news style site, with hub pages highlighting articles, competitions and galleries within the site.

During the refactoring as features neared completion in the CMS we used Rob Hudson's Django Debug Toolbar to keep an eye on query count. We kept down the number of queries using a number of approaches; where possible we cached, memoized, ensured that we used the ORM efficiently and made sure that we didn't repeat queries over and over. Daniel Roseman has written some great posts on Django Patterns sharing some insights learnt whilst refactoring the CMS and improving its performance, so I won't repeat them here.

Another bottleneck we had and again the debug toolbar provides this information was the number of templates it could take to render a page, as this problem had been identified and fixed in Django 1.2 with a relatively small patch we backported it to our version of Django.

Caching

Caching is incredibly important, most of the data on our pages is relatively static. Initially we looked at baking all the static parts of the pages to disk on creation or change of data in the CMS. This looked ideal, with that in place we could just use these pages as templates themselves and render any dynamic parts, therefore saving many queries and bypassing lots of logic. However, due to the flexibility of the CMS, if you factor in the localisations, symlinking and scheduling of modules on a page it wasn't as simple and robust an approach as hoped so was binned as it just wasn't worth the cost of building and maintaining it. However, I would recommend this approach, if its feasible as it makes sense to keep as much static stuff static. There is no need to force every user to rebuild part or sometimes all of the page if its non changing or infrequently changing data.

We liked the idea of caching the static parts and rendering the dynamic parts of the page and time based caching provided a simple and easy solution. However, when dealing with the "digg effect" our aim is to always serve the page and to help us do that, we decided that if the dynamic parts were failing under the load, then just serving the cached page, without the dynamic elements was preferable to serving a "fail whale".

Caching with Nginx

Nginx, Server Side Includes (SSI), proxies and proxy caching provided a solution to managing high load out of the box. The proxy_cache is faster than using memcached and can be made semi persistent, SSIs provide a mechanism to replace dynamic blocks in the cached page and even allow you handle if the rendering the ssi fails for any reason. For us and our site its a great solution to keeping the site up.

Heres the request / response flow:

Media_http2bpblogspot_xcztx

Shamelessly adapted from James Gardener (SSI, Memcached and Nginx (plus Varnish, ESI and static generation))

With Nginx we set the cache timeout to be low, but more importantly we keep the cache on disk for a longer period of time, so if anything happens to the django backend we can still serve:

Media_http3bpblogspot_bxfeh

Conclusions

This provides us with a robust solution, that can allow us to handle large surges in traffic and makes us resilient if the backend goes away for any reason. There is a cost, we lose some dynamic parts of the site, whether or not we have found the correct solution for us only time will tell, but for now we've solved a problem.

Nginx how to; Server Side Include (SSI) debugging

Server Side Includes (SSI) are a great feature of Nginx allowing you to cache core content of a page but dynamically replace any blocks of the page that are dynamic, for example login links / welcome text. But they can sometimes hold some gotchas, so heres a post on debugging SSI or debugging Nginx configs in general.

I spent a morning debugging on Nginx when SSI's weren't acting as expected - they were missing! including the stubs and I was losing markup on the page. When dealing with SSI's here are two key rules:

  • SSI's must be well formed - otherwise funny things happen!
  • Debugging isn't always easy, but patience is key!

SSI's must be well formed

Funny things happen when your SSI syntax is malformed! But finding the cause isn't always easy.

What's wrong with this snippet?

<!--# block name="default_message" -->  <p>Sorry, this feature is currently unavailable</p><!-- endblock --><!--# include virtual="/dynamic/footer-links" stub="default_message" -->

The results of the above snippet is all content / markup after it is truncated!

Debugging SSI's

Firstly, turn off Silent Errors in your config like so:

ssi_silent_errors off;
You should see in the browser any errors outputted as "[an error occurred while processing the directive]" if there has been an error with the SSI.

If you are using a proxy - check the proxy access logs to ensure that the requests are being made by Nginx! If they are ensure that they are rendering the expected output.

Next step is to check what Nginx is doing, turn on debugging and logging of SSI's in your Nginx config:

error_log   /var/log/nginx/error.log debug;    log_subrequest  on;

The error log will now start to report all logic steps and actions for requests made to it - they are quite extensive! You are looking for all SSI includes:

... [debug] 1545#0: *2 ssi include: "/dynamic/user_status_bar/"

If your SSI include isn't listed - then check your markup! It probably is malformed like mine above - I was missing a # from the endblock tag it should have read:

<!--# block name="default_message" -->  <p>Sorry, this feature is currently unavailable</p><!--# endblock --><!--# include virtual="/dynamic/footer-links" stub="default_message" -->

That simple typo cost me a morning, so check your markup! What actually was happening was that Nginx got all the SSI but it couldn't find the endblock and silently errored and discarded the rest of the content.