More on using Rewrite rules in .htaccess files

This article is a further discussion how to use rewrite rules on a shared hosting service (SHS) such as the one supplied by Webfusion and that I use.  It develops some earlier discussions in the following blog articles:

The main documentation source is the Apache HTTP server documentation, the .htaccess files tutorial and the documentation on mod_rewrite.  The former contains a lot of useful information on use of .htaccess files except anything related to rewrite which is covered by the latter which is now a complete section covering the more detailed aspects in eleven separate sub-sections, three of which are essential reading.

Of the remaining eight, two are useful in specific cases and the rest really only apply where you have access to the base Apache configuration and therefore aren’t so relevant to SHS users.

Most of what you need to understand is covered somewhere in these chapters, but it’s well worth scanning for “htaccess” and “per-directory” when you go through because this content is fragmented across them.  What I then want to do here is to cover the main points about the interaction of the rewrite functionality and .htaccess processing in a single article.

The overall Rewrite architecture

Rewrites fall into two categories:

  • A system administrator can put them into the Apache configuration (SHS providers will typically do this to map the individual user domains onto the correct root directory) and these will apply to the entire server.
  • Non-privileged users have to use the directory-specific processing (referred to “Perdir” in the rewrite logs) that interprets the rewrite rules in the .htaccess files.

The rewrite engine essentially does a loop:

do
  execute server rewrites (in the Apache Config)
  execute vhost rewrites (in the Apache Virtual Host Config)
  find the "Perdir" .htaccess file
  if found(.htaccess)
    execute .htaccess rewrites (in the user's directory)
 while rewrite occurred

Note that this loop only terminates after a pass where no rewriting has occurred and thus if you aren’t careful this can result in a loop which terminates with an error.

“Perdir” Rewrites

If you have access to the Apache config (for example when you are buying a VM service such as Webfusion VPS or Amazon EC2) then best practice is to use the server and vhosts rewrites and disable .htaccess use altogether.  SHS providers can’t do this as users will need to have access to rewrite functionality through “Perdir” .htaccess files and this is this category that I want to discuss in detail in this article.

What access files get processed

Processing of .htaccess files can only be enabled within the Apache configuration, and this is done as standard within SHS offerings.  (Note that the filename of the access files can be changed from the default .htaccess in the Apache configuration, but this is rarely done.)  When processing a request for any resource, Apache will try to open all possible access files on the path.  So in the case of an example, /blog/includes/tinymce/license.txt, it tries to open the following access files:

DOCUMENT_ROOT/.htaccess
DOCUMENT_ROOT/blog/.htaccess
DOCUMENT_ROOT/blog/includes/.htaccess
DOCUMENT_ROOT/blog/includes/tinymce/.htaccess
DOCUMENT_ROOT/blog/includes/tinymce/license.txt/.htaccess

where DOCUMENT_ROOT is the root directory of the user’s directory tree (in my case this is /websites/LinuxPackage02/el/li/so/ellisons.org.uk/public_html).  If any of the above files exists then it is read and cached during the processing of the request.  Doing a putative open and then handling the error condition if the file doesn’t exist may seem an odd implementation, but it is a cheap operation (in terms of runtime and system overhead) – say 0.1 mS – so this is an efficient way of traversing all possible .htaccess files.  Even this last case which might seem an odd one to do, but there is some logic in this: if licence.txt were a directory, then its .htaccess file would need to be processed; the error code in this case is “not a directory” which therefore acts as a file / directory test for licence.txt.

Note that the .htaccess file will only take part in rewrite processing if it includes a RewriteEngine on directive and Options FollowSymLinks has been set in the Apache server configuration (which is always the case in the case of SHS offerings).

My advice is to keep the number of access files to a minimum, and delete any that you don’t need.  You can do all the rewrite processing, access control, etc., from a single DOCUMENT_ROOT/.htaccess and I prefer this approach.  However, if you have multiple applications under your root (for example I’ve got six including a blog, a test wiki and a forum) then a sensible compromise is to add a second tier of access files with one per application at the next directory level, so that application-specific rewrite processing can be split per application.  This also has the benefit that you can create a test directory (hierarchy) to develop changes to your rules, and do this without causing the live applications to fail, but at the cost of an extra .htaccess file open and Perdir processing cycle.

Some Misconceptions

  • .htaccess rewrites are inefficient.  The Apache documentation uses phrases like “incredible”, “these are reached a very long time after the URLs have been translated to filenames”, “mod_rewrite should be considered a last resort”, …  This is just silly: the Apache child process starts to process the .htaccess in DOCROOT less than 0.2 mSec after reading GET request, and the entire rewrite processing typically takes a few mSec at most if rewrite logging is disabled (as is always the case on an SHS configuration) and the .htaccess files are precached in the server’s Virtual Filesystem Cache (VFC).  To put this time overhead into context, PHP image activation can take a typical 80 mSec, and reading a single file from a network storage mounted directory (the typical implementation for SHS server farm architectures), if not precached in the VFC or NAS cache, can take 200 mSec.  So the rewrite overheads are typically at least a couple of orders of magnitude less than PHP image activation and file access overheads.   The only material overhead is in reading in the .htaccess file(s), and since these are the ‘hottest’ files in an SHS application they will typically have excellent cache-hit ratios.
  • .htaccess rewrites are a minefield that are best avoided.  Unfortunately SHS application developers have no viable alternative for many Apache functions, and so using this functionality can’t be avoided. Yes, the .htaccess architecture and its implementation are complicated as a result of its evolution, but this also largely compounded by weaknesses in the current documentation, which does a poor job in explaining how to use .htaccess files to implement rewrite functionality.

How PerDir rewrite processing works

  • Processing is multi-pass. Dropping through the last rewrite or setting a “last” flag on a successful rewrite forces the end of a pass. If any processed rewrite rule has matched and has changed the URI or the query string, then this could mean that another rule might apply, so this is treated as an “INTERNAL REDIRECT” at the end of the pass, and processing is restarted. This cycle is terminated when the last pass makes no change to the URI or the query string.
  • Each pass uses a single .htaccess file. The deepest .htaccess file on the URI path with RewriteEngine on is used.  For historic reasons, this processing is convolved.
  • By this second stage the URI has been converted into a putative filename based on the path relative to DOCUMENT_ROOT.  This is then split into a “Per Dir” part (which includes the trailing /) and a relative URI part (where the leading / is missing).  So using an example which is generated initially by a request to http://blog.ellisons.org.uk/includes/tinymce/license.txt:
    • HTTP_HOST is blog.ellisons.org.uk
    • Per Dir is /websites/LinuxPackage02/el/li/so/ellisons.org.uk/public_html/
    • The bare URI is includes/tinymce/license.txt (note the missing leading /).
  • Any “?” delimiter and request parameters are also stripped from the URI before it is then used as the match string in any RewriteRule statements.  Thus you can’t examine request parameters in a RewriteRule.
  • The rewrite logic consists of a sequence of RewriteRule statements.  By default these are executed sequentially until the last one or a [last] flag is set on a successful rule.  However the order of rules can be modified by use of flags such as [chain] and [skip].
  • Any rule can be preceded by one or more RewriteCond statements.  These are used if the RewriteRule pattern matches.  The condition statements essentially evaluate to a go/no-go against the first parameter which is interpolated, and which can therefore include expanded variables, back-references, etc.  (Hence request parameters can only be accessed through using the %{QUERY_STRING} variable.)  This is then matched against the second parameter which is a condition pattern.  RewriteCond statements are used for two main purposes: to set match variables which can then be used as back references in the associated RewriteRule replacement string or to provide a guard to stop an endless loop substitution.  Note that the execution order of the rules and conditions are actually the other way around.
  • So in this example the root .htaccess applies and if it contained the following the URI would be rewritten as blog/includes/tinymce/license.txt and the flag would trigger internal redirection:
RewriteCond %{ENV:REDIRECT_STATUS} =””
RewriteCond %{SCRIPT_FILENAME}  !-f
RewriteCond %{HTTP_HOST}        (blog|wiki|forum)\.ellisons\.org\.uk   [nocase]
RewriteRule  ^(.*)              %1/$1                                  [last]   

# This is evaluated as follows. Note that the REDIRECT_STATUS environment variable is set by any 
# redirect processing cycle, so this condition is a safety check to ensure that this rule is only
# applied on the first rewrite pass and not on any subsequent loops. 
#
#If URI_pattern.match(‘^(.*)’)   and
#    %{ENV:REDIRECT_STATUS} = ”” and     # This variable which is set at the end of a 
#    %{SCRIPT_FILENAME}  !-f     and
#    %{HTTP_HOST}.match(‘(blog|wiki|forum)\.ellison6\.co\.uk’) then
#    URI_pattern = “%1/$1”   # where %1 is blog, wiki or forum and $1 is the URI from the RewriteRule
#    Break processing
#Endif
  • A second rewrite pass using DOCUMENT_ROOT/blog/.htaccess now applies with:
    • Per Dir is /websites/LinuxPackage02/el/li/so/ellisons.org.uk/public_html/blog
    • The REQUEST_URI match string is includes/tinymce/license.txt
  • So the rule patterns ($1 etc.) can be used in any condition string.  Any condition patterns (%1 etc.) can by used in the next condition string, or (in the case of the last condition pattern) the rewrite rule string.
  • Remember match patterns are on the left and strings (which can include variable expansion) or the right in the RewriteRule statements, and this is visa-versa in the case of RewriteCond statements: strings (which can include variable expansion) are on the left and match patterns on the right.
  • Remember to specify a RewriteBase (usually / for your document root .htaccess).
  • Following a successful rewrite the URI match string is replaced by the interpolated RewriteRule substitution string any following statements.
  • The rewrite engine will treat any RewriteRule substitution string with leading / as a (new) absolute directory discarding the current relative path.  How these are processed varies according to the Apache version, so the safest thing to do if you do want to specify an absolute path is to terminate the current rewrite cycle with a [last] flag and restart a new cycle (with possibly a new .htaccess file).
  • The rewrite engine will treat any RewriteRule substitution string containing a “? as setting a new QUERY_STRING which will replace the existing one unless the [qsappend] flag is set in which case the old string is appended.
  • Hence my advice is to use a relative path (no leading /) for the RewriteRule substitution string wherever practical.
  • Write down all of the rules that you need, and work out the dependency groups that will require chaining, and order the groups / singletons in approximate order of frequency.  If you are a programmer comfortable writing procedural logic then you might find it easier to write a version using if/then/else bracketting first , then manually convert it to the rewrite ladder logic form by adding the necessary [or], [chain], [skip], [next] and [last] flags.  Your aim should be to order the rules so that a single pass executes the rewriting (or two passes one root and one application .htaccess in the case of a two level split.
  • Make sure that you’ve got the Regexp syntax correct, so that each gives the expected match variables for a set of test patterns.
  • Don’t forget that Perdir processing is cyclic and this is an intrinsic characteristic of how the engine works.  Nonetheless, I regard such looping as a bad practice to be avoided wherever possible.
  • Make sure that you add the necessary conditions for each rewrite rule (i) to generate any back references needed in the replacement string, and (ii) to prevent unnecessary refiring of the rule on following internal redirection passes.
  • If you don’t have access to your own test Apache instance where you can set Rewrite debugging, test the rules in a separate test subdirectory, adding them one at a time, because the only diagnostic that will be available to you are the status returns code on failure.  Be aware that patterns might fail when you don’t expect them to.  For example in the above rule had been “^(.+)   %1/$1” then this would fail on the URI http://blog.ellisons.org.uk/ as the match string would be “” and (.+) requires at least one character.

So wrapping this up this example if I adopted a two level .htaccess for my blog, then this would give me this set of rules in the DOCUMENT_ROOT/blog/.htaccess file:

RewriteEngine on
RewriteBase /blog

# If the URI maps to a file that exists then stop. This will kill endless loops

RewriteCond %{REQUEST_FILENAME}     -f
RewriteRule .*                      -                   [last]

# If the request is HTML cacheable (a GET to a specific list and with no query string, the
# user is not logged on) and the HTML cache file exists then use it instead of executing PHP

RewriteCond %{HTTP_COOKIE}                                              !blog_user
RewriteCond %{REQUEST_METHOD}%{QUERY_STRING}                            =GET               [nocase]
RewriteCond %{DOCUMENT_ROOT}/blog/html_cache/$1.html                    -f
RewriteRule ^(article-\d+|index|sitemap.xml|search-\w+|rss-[0-9a-z]*)$  html_cache/$1.html [last]

# Anything else pass to index.php

RewriteRule (.*) index.php?page=$1    [qsappend,last]

One complication here is the use of the RewriteBase.  This doesn’t apply to the rewrite rule pattern, but it is added in to the substitution strings and the REQUEST_URI variable. You can see this in an extract of the rewrite log (with debugging enabled) for a request to http://blog.ellisons.org.uk/article-59 as follows (note that I’ve trimmed some of the header fields and replaced the document root by DOCROOT to save space.)

init rewrite engine with requested uri /article-59
pass through /article-59
[perdir DOCROOT/] strip per-dir prefix: DOCROOT/article-59 -> article-59
[perdir DOCROOT/] applying pattern '^(.*)' to uri 'article-59'
[perdir DOCROOT/] RewriteCond: input='' pattern=''  => matched
[perdir DOCROOT/] RewriteCond: input='DOCROOT/article-59' pattern='!-f' => matched
[perdir DOCROOT/] RewriteCond: input='blog.ellison6.home' pattern='(blog|wiki|forum)\.ellison6\.home' [NC] => matched
[perdir DOCROOT/] rewrite 'article-59' -> 'blog/article-59'
[perdir DOCROOT/] add per-dir prefix: blog/article-59 -> DOCROOT/blog/article-59
[perdir DOCROOT/] trying to replace prefix DOCROOT/ with /
strip matching prefix: DOCROOT/blog/article-59 -> blog/article-59
add subst prefix: blog/article-59 -> /blog/article-59
[perdir DOCROOT/] internal redirect with /blog/article-59 [INTERNAL REDIRECT]
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/article-59 -> article-59 
[perdir DOCROOT/blog/] applying pattern '.*' to uri 'article-59'
[perdir DOCROOT/blog/] RewriteCond: input='DOCROOT/blog/article-59' pattern='-f' => not-matched
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/article-59 -> article-59 
[perdir DOCROOT/blog/] applying pattern '^(article-\d+|index|sitemap.xml|search-\w+|rss-[0-9a-z]*)$' to uri 'article-59'
[perdir DOCROOT/blog/] RewriteCond: input='blog_email=XXXXXXXX; blog_user=XXXX; blog_token=XXXXXXXXXXXXXXXX' pattern='!blog_user' => not-matched
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/article-59 -> article-59 
[perdir DOCROOT/blog/] applying pattern '(.*)' to uri 'article-59'
[perdir DOCROOT/blog/] rewrite 'article-59' -> 'index.php?page=article-59'
split uri=index.php?page=article-59 -> uri=index.php, args=page=article-59
[perdir DOCROOT/blog/] add per-dir prefix: index.php -> DOCROOT/blog/index.php
[perdir DOCROOT/blog/] trying to replace prefix DOCROOT/blog/ with /blog
strip matching prefix: DOCROOT/blog/index.php -> index.php
add subst prefix: index.php -> /blog/index.php
[perdir DOCROOT/blog/] internal redirect with /blog/index.php [INTERNAL REDIRECT]
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/index.php -> index.php
[perdir DOCROOT/blog/] applying pattern '.*' to uri 'index.php'
[perdir DOCROOT/blog/] RewriteCond: input='DOCROOT/blog/index.php' pattern='-f' => matched
[perdir DOCROOT/blog/] pass through DOCROOT/blog/index.php

Diagnosing .htaccess usage

There is a simple fact of life here: debugging .htaccess files on a shared service is very difficult, and this is for one main reason: even though Apache server has facilities to carry out detailed logging, enabling logging incurs major performance penalties, so service providers do not enable this through the Apache configuration.  So you as an SHS user/developer are left with a process of trial and error where the only error diagnostics that you get is the HTTP status return code if the rewrite fails.  Hence I do all of my .htaccess rule debugging on a local test VM which mirrors my SHS environment.  Here I can set the RewriteLogLevel to dump diagnostics for the Apache instance to a rewrite.log file.  If you can’t do this then another trick is to use environment variables to pass parameters between rules and to the executing script, writing intermediate test strings to environment variables using the [E=VAR:val] constructs which are then accessible by the invoked script as the environment variable REDIRECT_VAR.

Because I use a Linux laptop as my PC, as I’ve discussed on previous articles, I can also use the strace system utility to log all of the system calls executed by the Apache child processes as follows:

# Start an strace on all www-data children
sudo rm /tmp/strace.*
sudo strace  -u www-data -tt -ff -o /tmp/strace $(ps -o "-p %p" h -u www-data)

This strace diagnostic enables me to look at the relative timelines, and file / directory access and the writes to the rewrite.log can then be tied up to the corresponding log records.  This information, plus code inspection of the mod_rewrite source enables me to retro-engineer the module’s processing.

I can examine the timelines, what processing takes place, for roughly how long and when, and what I/O takes place.  Unfortunately doing this is really in the skill domain of an experienced developer and probably a bit daunting for the average SHS user.  However, the reasons that I’ve included this footnote are (i) for those readers who would like to drill down themselves to understand what is going on in the rewrite processing; and (ii) to underline that the recommendations that I make in the preceding section are not simply a matter of opinion, but are underpinned by hard timing and file access data.

Leave a Reply