RSS Feeds

A lightweight HTML cache for a Webfusion shared service

This article takes the optimisation of my blog engine to a next stage.  It develops some of the thoughts and concepts that I’ve previously discussed in the following articles:

Given the overhead of initiating a PHP process and compiling a PHP script in a suPHP-based shared environment, the most efficient script usually turns out to be one that isn’t invoked at all.  Page caching technologies such as Squid can provide significant benefits in throughput in PHP applications where the content is largely static, and a good example of this use on a large scale is Wikipedia, where its article pages are largely static for general users.  Wikipedia uses clusters of squid caches to serve the vast majority of page requests from general users, and thus avoids the process overhead of rendering the pages from within a PHP script.  Unfortunately, solutions like squid are only appropriate for large scale applications such as Wikipedia, so what I want to discuss here an appropriate approach for shared applications such as my blog where the page content is also largely static.

In the case of a blog application, the user population can be divided into three groups: (i) general users who are reading blog content; (ii) bots such as googlebot which are scanning it for search purposes and (iii) the authors (in this case me) who will also want to do more advanced functions such as content editing and management.  So the significant majority of page requests return piecewise static HTML content, and I can exploit the fact that I use a common templating engine to generate all blog page output to add minor changes to implement such cache functionality:

So what should I and shouldn’t I cache?

Clearly I can’t cache any pages which:

Hence I have to limit myself to GET requests from guest (not logged-in) users for basic page access (without URL parameters).  However, I already use rewrite rules to mask requests in the form of <function>-<subfunction> for common requests (for example http://blog.ellisons.org.uk/article-45 requests the page for this article).  So I can still cache the great majority of page requests to my blog.  The home and article pages can be cached, and these involve a lot of database access and text processing (in the preparation of articles for display).  Likewise, I can also cached the quasi-static keyword search, root archive list and sitemap.xml pages.

I also checked my access logs to look at page request statistics and realised that I also need to cache the RSS request pages.

The changes to the HTACCESS file

Here is the piece of rewrite magic which implements the cache:

RewriteCond %{HTTP_COOKIE}           !blog_user
RewriteCond %{REQUEST_METHOD}        =GET                      [NC]
RewriteCond %{REQUEST_URI}           /(article-\d+|index.html|index|sitemap.xml|search-\w+|rss-[0-9a-z]*)$
RewriteCond %{REQUEST_URI}           /(article-\d+|index|sitemap|search-\w+|rss-[0-9a-z]*)
RewriteCond %{DOCUMENT_ROOT}/b/html_cache/%1.html -f
RewriteRule ^blog/([^.]*).*$        /b/html_cache/$1.html     [L]

If you have read my earlier .htaccess article, then you will recall that Webfusion does not decode sub-domains directly, so you need to use a rewrite rule to do this.  In my case my blog sub-domain gets mapped onto a blog pseudo subdirectory.  This rule applies to any request to this pseudo subdirectory (the RewriteRule directive), and subject to the conditions, it does an internal map to the corresponding cached HTML file:

Hence if you are a guest reading this article, then you will probably be viewing the cached version pulled from an internal redirect to /websites/LinuxPackage02/el/li/so/ellisons.org.uk/public_html/b/html_cache/article-45.html.

Yes, this is a somewhat complex rewrite rule, and yes, this involves an Apache overhead, but this is tiny compared that of loading the PHP compiler/interpreter under suPHP and running the script.

Cache coherency

As I discuss above, the HTML cache can contain articles, the top level index, the sitemap XML and keyword searches.  Updates to any of these can make cache content stale, and therefore update functions may need to trigger the deletion of cache files.  I adopt a lazy approach to refilling the cache: that is updates do not trigger the corresponding recreation of previously cached pages; this is left to the next request which will trigger the PHP script and this will recreate the file.

To keep the coherency logic simple, I have not adopted an absolutely minimal approach to cache entry deletion.  For example, the index page includes a “taster” extract of the most recent N articles.  Hence updating an article this could potentially make the cached index page stale.  I could put in a check to ensure that the updated article is one of the most recent N, or even a check to see if the extract that appears on the index page has actually changed.  Nope, this just isn’t worth the complication.  If I change any article, then I delete the index.  At worst this generates one extra script invocation to recreate it per article following an article change/creation.  This isn’t a material cost given that I typically will only make at most one such change per day.  Hence I have implemented the following cache purge rules:

Note that article creation does not immediately impact the cache as article creation is an admin function and they are initially hidden from general users.  Unhiding an article is an attribute change which purges the entire cache.

Etags and all that

Remember that client-side caching within the user’s browser is also an important server load and response factor.  The cached HTML has its Etags automatically generated by Apache to facilitate this.  I drop the Inode from the default setting, and instead use FileETag MTime Size.  My reason is that the shared service is hosted by a cluster and the Inode representation might vary depending on the serving node, resulting in spurious Etag missies. 

I don’t also bother recreating this Etag for the scripted output which means for the user whose request initiates the script which primes the cache, his or her browser will end up doing a full download for the next view as the Etags will be different.  Tough.

Also because the HTML can now be flushed (e.g. when a user adds a comment), I’ve dropped the ExpiresByType option for HTML files.  Any browser cache validation will be via an Etag validation.