A lightweight HTML cache for a Webfusion shared service

This article takes the optimisation of my blog engine to a next stage.  It develops some of the thoughts and concepts that I’ve previously discussed in the following articles:

Given the overhead of initiating a PHP process and compiling a PHP script in a suPHP-based shared environment, the most efficient script usually turns out to be one that isn’t invoked at all.  Page caching technologies such as Squid can provide significant benefits in throughput in PHP applications where the content is largely static, and a good example of this use on a large scale is Wikipedia, where its article pages are largely static for general users.  Wikipedia uses clusters of squid caches to serve the vast majority of page requests from general users, and thus avoids the process overhead of rendering the pages from within a PHP script.  Unfortunately, solutions like squid are only appropriate for large scale applications such as Wikipedia, so what I want to discuss here an appropriate approach for shared applications such as my blog where the page content is also largely static.

In the case of a blog application, the user population can be divided into three groups: (i) general users who are reading blog content; (ii) bots such as googlebot which are scanning it for search purposes and (iii) the authors (in this case me) who will also want to do more advanced functions such as content editing and management.  So the significant majority of page requests return piecewise static HTML content, and I can exploit the fact that I use a common templating engine to generate all blog page output to add minor changes to implement such cache functionality:

  • The output function of the templating engine generates an HTML file copy of any page that it outputs if: (i) the HTML file doesn’t already exist; (ii) the requesting user is a guest; the request does not include any parameters.
  • The .htaccess file’s redirect logic checks for the existence of the HTML page if the request has no parameters.  If the page exists then an internal redirect is made to the page, otherwise to the PHP script.
  • Any update function deletes any cached HTML pages that are rendered stale by the update, and these will then be recreated by the appropriate PHP script on the next access.

So what should I and shouldn’t I cache?

Clearly I can’t cache any pages which:

  • Need POST processing to be carried out, since these requests need to be executed by a script in order to carry out any processing needed.
  • Include request parameters which might impact the displayed content, since I can’t easily bind any parameters into the cache filename within .htaccess rewrite rules.
  • Include output that depends on session context.  This is the case for content authors and administrators, but they must log in to access such functions, and their session context in maintained in a couple of cookies (one of which is the cookie “blog_user”).

Hence I have to limit myself to GET requests from guest (not logged-in) users for basic page access (without URL parameters).  However, I already use rewrite rules to mask requests in the form of <function>-<subfunction> for common requests (for example /article/blog-sw/a-lightweight-html-cache-for-a-webfusion-shared-service/ requests the page for this article).  So I can still cache the great majority of page requests to my blog.  The home and article pages can be cached, and these involve a lot of database access and text processing (in the preparation of articles for display).  Likewise, I can also cached the quasi-static keyword search, root archive list and sitemap.xml pages.

I also checked my access logs to look at page request statistics and realised that I also need to cache the RSS request pages.

The changes to the HTACCESS file

Here is the piece of rewrite magic which implements the cache:

RewriteCond %{HTTP_COOKIE}           !blog_user
RewriteCond %{REQUEST_METHOD}        =GET                      [NC]
RewriteCond %{REQUEST_URI}           /(article-\d+|index.html|index|sitemap.xml|search-\w+|rss-[0-9a-z]*)$
RewriteCond %{REQUEST_URI}           /(article-\d+|index|sitemap|search-\w+|rss-[0-9a-z]*)
RewriteCond %{DOCUMENT_ROOT}/b/html_cache/%1.html -f
RewriteRule ^blog/([^.]*).*$        /b/html_cache/$1.html     [L]

If you have read my earlier .htaccess article, then you will recall that Webfusion does not decode sub-domains directly, so you need to use a rewrite rule to do this.  In my case my blog sub-domain gets mapped onto a blog pseudo subdirectory.  This rule applies to any request to this pseudo subdirectory (the RewriteRule directive), and subject to the conditions, it does an internal map to the corresponding cached HTML file:

  • Condition 1 excludes any requests from logged on users.
  • Condition 2 ensures that only get requests are processed
  • Conditions 3 and 4 limit the requests to articles, the home page, the site map, any keyword searches and rss feeds.  The reason for the two conditions is that the second is used to pick up the filename root in %1, as I want to exclude the extension in the case of index.html and sitemap.xml.
  • Condition 5 checks for the existence of the cache file. I do this last because I want to avoid the overhead of the check for the existence of the file, if I already have rejected using a cached version for the other reasons.

Hence if you are a guest reading this article, then you will probably be viewing the cached version pulled from an internal redirect to /websites/LinuxPackage02/el/li/so/ellisons.org.uk/public_html/b/html_cache/article-45.html.

Yes, this is a somewhat complex rewrite rule, and yes, this involves an Apache overhead, but this is tiny compared that of loading the PHP compiler/interpreter under suPHP and running the script.

Cache coherency

As I discuss above, the HTML cache can contain articles, the top level index, the sitemap XML and keyword searches.  Updates to any of these can make cache content stale, and therefore update functions may need to trigger the deletion of cache files.  I adopt a lazy approach to refilling the cache: that is updates do not trigger the corresponding recreation of previously cached pages; this is left to the next request which will trigger the PHP script and this will recreate the file.

To keep the coherency logic simple, I have not adopted an absolutely minimal approach to cache entry deletion.  For example, the index page includes a “taster” extract of the most recent N articles.  Hence updating an article this could potentially make the cached index page stale.  I could put in a check to ensure that the updated article is one of the most recent N, or even a check to see if the extract that appears on the index page has actually changed.  Nope, this just isn’t worth the complication.  If I change any article, then I delete the index.  At worst this generates one extra script invocation to recreate it per article following an article change/creation.  This isn’t a material cost given that I typically will only make at most one such change per day.  Hence I have implemented the following cache purge rules:

  • Article comment addition.  Purge the cached article, and the cached sitemap.  (Comments aren’t listed on the index so there is no need to purge the index page.)
  • Article content update.  Purge the cached article, the cached index, the cached sitemap.
  • Article any attribute change.  Purge the entire cache.  The article attributes title, creation date/timestamp, hidden flag, etc. and these impact article ordering on every sidebar.

Note that article creation does not immediately impact the cache as article creation is an admin function and they are initially hidden from general users.  Unhiding an article is an attribute change which purges the entire cache.

Etags and all that

Remember that client-side caching within the user’s browser is also an important server load and response factor.  The cached HTML has its Etags automatically generated by Apache to facilitate this.  I drop the Inode from the default setting, and instead use FileETag MTime Size.  My reason is that the shared service is hosted by a cluster and the Inode representation might vary depending on the serving node, resulting in spurious Etag missies.

I don’t also bother recreating this Etag for the scripted output which means for the user whose request initiates the script which primes the cache, his or her browser will end up doing a full download for the next view as the Etags will be different.  Tough.

Also because the HTML can now be flushed (e.g. when a user adds a comment), I’ve dropped the ExpiresByType option for HTML files.  Any browser cache validation will be via an Etag validation.

Leave a Reply