This blog has pretty much turned into a self-referential exercise as a major theme in my articles is the development and performance of the blog engine itself. This started because in last October, I decided to take an active interest in my blog again, but dislike the creative experience offered by the existing blog engine, and so I decided to rewrite the engine from scratch as a project. This rewrite is largely complete so I have decided to write up my architecture and make a copy of the code publicly available. This article describes this architecture.
Perhaps the five major design criteria for the engine were the following:
- I had an existing blog and I was reasonably happy with the look-and-feel from a user perspective. It’s just that I disliked the implementation, the performance characteristics and the editor’s experience. Nonetheless, the existing blog engine (Eggblog V3) provided a baseline set of worked use-cases to start from.
- I wanted system where the authoring of new content and the correction of existing content was pretty seamless and based on a reasonably full featured WYSIWYG experience. I did look at various AtomPub and MetaWeblog based editors, but none of the OSS ones seemed to match the feature set and customisability of the OpenOffice.org Writer support for HTML, and a simple interface though an admin directory mapped onto my local filesystem involved less development effort. I am writing this article using in OOo and this is integrated into the blog and so I’ve fulfilled this goal, and when I don’t have OOo to hand then I also have the option to drop into the TinyMCE editor to edit content.
- I wanted to separate the page presentation to be managed by a lightweight templating system (which I have already described here).
- I wanted blog application well tuned in performance terms to the sort of shared-service hosting offering that I run my blog on (a Webfusion shared Linux package). As I describe below, this engine is about as light and responsive as is practically possible, and significantly more streamlined than, say, WordPress for this type of platform.
- Since I develop my blog on my laptop (Ubuntu with a LAMP stack), I have gotten into the habit of doing all bulk edits on my local blog instance. So I also have a one-click function to synchronise any changes with my public blog.ellisons.org.uk copy. This enables my to work on any articles locally and only synchronise when I feel that they are ready for release.
I have now reached the point where I have achieved all five and the engine is quite stable, though I still maintain a “To-Do list” of minor fixes and enhancements which I am working through. But let us move on to the overall architecture.
Use of HTACCESS dispatch
As with all applications targeted at a shared service environment, I have no access to the Apache Web Server layer as the developer and therefore must rely on the appropriate .htaccess file for such configuration. As I have described in previous articles (“Using .htaccess files on a Webfusion shared service” and “A lightweight HTML cache for a Webfusion shared service”), I have configured the system:
- To implement my subdomains.
- To negotiate a sensible caching strategy with the client browsers to minimise the unnecessary transfer of static page content.
- To implement an HTML cache for piece-wise static page content, so that guest requests for such pages are internally redirected to the previously rendered HTML page, thus obviating the need for any PHP script execution in these circumstances. (The creation of these cached pages is handled largely transparent to the application tier within the templating engine.)
- To catch and map pseudo-directory format URIs, such as /article-47, (when they aren’t in the HTML cache) to a single entry-point script, index.php, with the page requested as a get parameter.
- To prevent web access to private directories and files. Any request URI which contains “/_” or “/.” is treated as forbidden.
Script architecture
All application requests come through a common entry script, which includes some common modules for configuration data, a database access layer, common utilities, and the templating engine. It then dispatches to a dynamically loaded module which handles all of the page-specific processing for that page, so the module search.php is used to process all search requests, etc. These modules are loaded from a private subdirectory _include. I have separate modules for
- about, to process the about page, and this redirects to the about article
- admin, to process all administrator / editor functions
- archive, to process the archive page
- article, to process the article page view. The handling of article comments is passed to a secondary comments module, and the extra functions that are accessible to an admin are handled by routines in the admin module.
- index, which handles the home page. This is in the includes directory and distinct from the main index.php catch-all routine. The name conflict is regrettable, but for historic reasons.
- photo, to handle the management of images
- rss to implement all rss-related requests such as the articles and (admin-only) comments feeds
- search, to process search requests
- sitemap, to process crawler sitemap requests
- sync, to carry out article syncronisation between my local development blog instance and the production instance on the Webfusion shared service.
- invalid, a catch-all error module to handle any request not on the above allowed list
Each module will typically use the templating engine to prepare the output page by doing $page->assign() calls to bind any output fields and then a $page->ouput() calls to render the page itself, and to create an HTML cache copy if necessary. The use of the templating engine and the database abstraction layer helps to keep this module code short (typically 100-200 lines), and my source for the entire application is less than 3,000 lines. (The standard TinyMCE code is on top of this).
As I discussed in previous performance articles, the main factor in keeping the response latency short in this type of service plateform is to minimise the number of script files that have to be loaded and parsed to execute any given request. However I also need to balance this against sensible modularisation practices for maintainability. To square this circle, I have written a simple PHP script marshalling utility, stripAssemble.php. (The code is an appendix to this article). I maintain the master unmarshalled version of the dispatching index.php as _debug_index.php. (The leading underscore prevents it being directly accessed by a web request.) When I am debugging I just overwrite the index.php with this, and when completed I can run stripAssemble to rebuild a new index.php. This scans the debug version replacing any request() or include() statements starting in column 1 with the content inline and then removes all comments to create a single consolidated production index.php. The preamble in this debug file is currently as follows
<?php ini_set( "arg_separator.output", "&" ); ini_set( 'error_reporting', E_ALL | E_STRICT ); ini_set( 'display_errors', False ); ini_set( 'include_path', './_include' . PATH_SEPARATOR . './_cache' ); require("config.php"); require("DB.php"); require("common.php"); require("template1.php"); require("index.php"); require("article.php"); include("article_EN.php"); include("index_EN.php"); # Get the standard context, then preload the PHP module if necessary and despatch to the requested page ...
Hence this standard aggregated index.php module includes code to handle the home page and normal article views within a single file load. OK, this script is some 700 lines long, but this still only takes a few milliseconds to compile on a modern server, and this is a couple of orders of magnitude less than the I/O delays in reading in the separate half dozen or so files if not already in the system file cache. Processing the other less frequent functions will typically require an extra PHP module load and a template load (except of course in the 90% plus cases where the request is already cached in the HTML cache and no script execution is required at all.)
I wrap this stripAssemble call in a script which first clears the template cache, replaces the root index.php with the debug version and then use wget to prime the template cache for the EN versions of the index and article templates before calling stripAssemble to assemble the production version.
StripAssemble
#! /usr/bin/php <?php $outFile = $argv[1]; chdir( dirname( $outFile ) ); $inFile = file_get_contents( '_debug_' . basename( $outFile ) ); if( preg_match( '/^ \s* ini_set \s*\(\s* ["\']include_path["\'].*/xm', $inFile, $m ) ) eval ($m[0]); # scan input file replacing any include or require statements starting in column 1 $source = preg_replace_callback( '/^ (require|include) \s*\(\s* (\'|") (.*?) \\2 \s*\)\s* ;/xm', function( $inc ) { $contents = @file_get_contents("$inc[3]", FILE_USE_INCLUDE_PATH); if( $contents === false ) $contents = ""; echo ($contents ? "Including " : "Missing "), $inc[3], "\n"; return preg_replace( array( "/^<\?php/", "/\?>$/"), array("",""), $contents, 1 ); }, $inFile); # Now pack the source code. This is based on the Tokenizer example in the PHP Documentation $tokens = token_get_all($source); $output = ""; foreach( $tokens as $token ) { if(is_string( $token ) ) { // simple 1-character token $output .= $token; } else { // token array list( $id, $text ) = $token; switch( $id ) { case T_COMMENT: case T_DOC_COMMENT: break; case T_WHITESPACE: $output .= ( ( strpos( $text, "\n" ) === false ) ? ' ' : "\n" ); break; default: $output .= $text; // anything else -> output "as is" break; } } } file_put_contents( $outFile, $output ); echo "$outFile processed. Total input = ",strlen($source),", output = ",strlen($output),"\n";