RSS Feeds

More on using Rewrite rules in .htaccess files

This article is a further discussion how to use rewrite rules on a shared hosting service (SHS) such as the one supplied by Webfusion and that I use.  It develops some earlier discussions in the following blog articles:

The main documentation source is the Apache HTTP server documentation, the .htaccess files tutorial and the documentation on mod_rewrite.  The former contains a lot of useful information on use of .htaccess files except anything related to rewrite which is covered by the latter which is now a complete section covering the more detailed aspects in eleven separate sub-sections, three of which are essential reading.

Of the remaining eight, two are useful in specific cases and the rest really only apply where you have access to the base Apache configuration and therefore aren’t so relevant to SHS users.

Most of what you need to understand is covered somewhere in these chapters, but it's well worth scanning for “htaccess” and “per-directory” when you go through because this content is fragmented across them.  What I then want to do here is to cover the main points about the interaction of the rewrite functionality and .htaccess processing in a single article.

The overall Rewrite architecture

Rewrites fall into two categories:

The rewrite engine essentially does a loop:

do
  execute server rewrites (in the Apache Config)
  execute vhost rewrites (in the Apache Virtual Host Config)
  find the "Perdir" .htaccess file
  if found(.htaccess)
    execute .htaccess rewrites (in the user's directory)
 while rewrite occurred

Note that this loop only terminates after a pass where no rewriting has occurred and thus if you aren’t careful this can result in a loop which terminates with an error.

"Perdir" Rewrites

If you have access to the Apache config (for example when you are buying a VM service such as Webfusion VPS or Amazon EC2) then best practice is to use the server and vhosts rewrites and disable .htaccess use altogether.  SHS providers can't do this as users will need to have access to rewrite functionality through "Perdir" .htaccess files and this is this category that I want to discuss in detail in this article.

What access files get processed

Processing of .htaccess files can only be enabled within the Apache configuration, and this is done as standard within SHS offerings.  (Note that the filename of the access files can be changed from the default .htaccess in the Apache configuration, but this is rarely done.)  When processing a request for any resource, Apache will try to open all possible access files on the path.  So in the case of an example, /blog/includes/tinymce/license.txt, it tries to open the following access files:

DOCUMENT_ROOT/.htaccess
DOCUMENT_ROOT/blog/.htaccess
DOCUMENT_ROOT/blog/includes/.htaccess
DOCUMENT_ROOT/blog/includes/tinymce/.htaccess
DOCUMENT_ROOT/blog/includes/tinymce/license.txt/.htaccess

where DOCUMENT_ROOT is the root directory of the user’s directory tree (in my case this is /websites/LinuxPackage02/el/li/so/ellisons.org.uk/public_html).  If any of the above files exists then it is read and cached during the processing of the request.  Doing a putative open and then handling the error condition if the file doesn’t exist may seem an odd implementation, but it is a cheap operation (in terms of runtime and system overhead) – say 0.1 mS – so this is an efficient way of traversing all possible .htaccess files.  Even this last case which might seem an odd one to do, but there is some logic in this: if licence.txt were a directory, then its .htaccess file would need to be processed; the error code in this case is “not a directory” which therefore acts as a file / directory test for licence.txt.

Note that the .htaccess file will only take part in rewrite processing if it includes a RewriteEngine on directive and Options FollowSymLinks has been set in the Apache server configuration (which is always the case in the case of SHS offerings).

My advice is to keep the number of access files to a minimum, and delete any that you don’t need.  You can do all the rewrite processing, access control, etc., from a single DOCUMENT_ROOT/.htaccess and I prefer this approach.  However, if you have multiple applications under your root (for example I’ve got six including a blog, a test wiki and a forum) then a sensible compromise is to add a second tier of access files with one per application at the next directory level, so that application-specific rewrite processing can be split per application.  This also has the benefit that you can create a test directory (hierarchy) to develop changes to your rules, and do this without causing the live applications to fail, but at the cost of an extra .htaccess file open and Perdir processing cycle.

Some Misconceptions

How PerDir rewrite processing works

RewriteCond %{ENV:REDIRECT_STATUS} =””
RewriteCond %{SCRIPT_FILENAME}  !-f
RewriteCond %{HTTP_HOST}        (blog|wiki|forum)\.ellisons\.org\.uk   [nocase]
RewriteRule  ^(.*)              %1/$1                                  [last]   

# This is evaluated as follows. Note that the REDIRECT_STATUS environment variable is set by any 
# redirect processing cycle, so this condition is a safety check to ensure that this rule is only
# applied on the first rewrite pass and not on any subsequent loops. 
#
#If URI_pattern.match(‘^(.*)’)   and
#    %{ENV:REDIRECT_STATUS} = ”” and     # This variable which is set at the end of a 
#    %{SCRIPT_FILENAME}  !-f     and
#    %{HTTP_HOST}.match(‘(blog|wiki|forum)\.ellison6\.co\.uk’) then
#    URI_pattern = “%1/$1”   # where %1 is blog, wiki or forum and $1 is the URI from the RewriteRule
#    Break processing
#Endif

So wrapping this up this example if I adopted a two level .htaccess for my blog, then this would give me this set of rules in the DOCUMENT_ROOT/blog/.htaccess file:

RewriteEngine on
RewriteBase /blog

# If the URI maps to a file that exists then stop. This will kill endless loops

RewriteCond %{REQUEST_FILENAME}     -f
RewriteRule .*                      -                   [last]

# If the request is HTML cacheable (a GET to a specific list and with no query string, the
# user is not logged on) and the HTML cache file exists then use it instead of executing PHP

RewriteCond %{HTTP_COOKIE}                                              !blog_user
RewriteCond %{REQUEST_METHOD}%{QUERY_STRING}                            =GET               [nocase]
RewriteCond %{DOCUMENT_ROOT}/blog/html_cache/$1.html                    -f
RewriteRule ^(article-\d+|index|sitemap.xml|search-\w+|rss-[0-9a-z]*)$  html_cache/$1.html [last]

# Anything else pass to index.php

RewriteRule (.*) index.php?page=$1    [qsappend,last]

One complication here is the use of the RewriteBase.  This doesn’t apply to the rewrite rule pattern, but it is added in to the substitution strings and the REQUEST_URI variable. You can see this in an extract of the rewrite log (with debugging enabled) for a request to http://blog.ellisons.org.uk/article-59 as follows (note that I’ve trimmed some of the header fields and replaced the document root by DOCROOT to save space.)

init rewrite engine with requested uri /article-59
pass through /article-59
[perdir DOCROOT/] strip per-dir prefix: DOCROOT/article-59 -> article-59
[perdir DOCROOT/] applying pattern '^(.*)' to uri 'article-59'
[perdir DOCROOT/] RewriteCond: input='' pattern=''  => matched
[perdir DOCROOT/] RewriteCond: input='DOCROOT/article-59' pattern='!-f' => matched
[perdir DOCROOT/] RewriteCond: input='blog.ellison6.home' pattern='(blog|wiki|forum)\.ellison6\.home' [NC] => matched
[perdir DOCROOT/] rewrite 'article-59' -> 'blog/article-59'
[perdir DOCROOT/] add per-dir prefix: blog/article-59 -> DOCROOT/blog/article-59
[perdir DOCROOT/] trying to replace prefix DOCROOT/ with /
strip matching prefix: DOCROOT/blog/article-59 -> blog/article-59
add subst prefix: blog/article-59 -> /blog/article-59
[perdir DOCROOT/] internal redirect with /blog/article-59 [INTERNAL REDIRECT]
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/article-59 -> article-59 
[perdir DOCROOT/blog/] applying pattern '.*' to uri 'article-59'
[perdir DOCROOT/blog/] RewriteCond: input='DOCROOT/blog/article-59' pattern='-f' => not-matched
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/article-59 -> article-59 
[perdir DOCROOT/blog/] applying pattern '^(article-\d+|index|sitemap.xml|search-\w+|rss-[0-9a-z]*)$' to uri 'article-59'
[perdir DOCROOT/blog/] RewriteCond: input='blog_email=XXXXXXXX; blog_user=XXXX; blog_token=XXXXXXXXXXXXXXXX' pattern='!blog_user' => not-matched
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/article-59 -> article-59 
[perdir DOCROOT/blog/] applying pattern '(.*)' to uri 'article-59'
[perdir DOCROOT/blog/] rewrite 'article-59' -> 'index.php?page=article-59'
split uri=index.php?page=article-59 -> uri=index.php, args=page=article-59
[perdir DOCROOT/blog/] add per-dir prefix: index.php -> DOCROOT/blog/index.php
[perdir DOCROOT/blog/] trying to replace prefix DOCROOT/blog/ with /blog
strip matching prefix: DOCROOT/blog/index.php -> index.php
add subst prefix: index.php -> /blog/index.php
[perdir DOCROOT/blog/] internal redirect with /blog/index.php [INTERNAL REDIRECT]
[perdir DOCROOT/blog/] strip per-dir prefix: DOCROOT/blog/index.php -> index.php
[perdir DOCROOT/blog/] applying pattern '.*' to uri 'index.php'
[perdir DOCROOT/blog/] RewriteCond: input='DOCROOT/blog/index.php' pattern='-f' => matched
[perdir DOCROOT/blog/] pass through DOCROOT/blog/index.php

Diagnosing .htaccess usage

There is a simple fact of life here: debugging .htaccess files on a shared service is very difficult, and this is for one main reason: even though Apache server has facilities to carry out detailed logging, enabling logging incurs major performance penalties, so service providers do not enable this through the Apache configuration.  So you as an SHS user/developer are left with a process of trial and error where the only error diagnostics that you get is the HTTP status return code if the rewrite fails.  Hence I do all of my .htaccess rule debugging on a local test VM which mirrors my SHS environment.  Here I can set the RewriteLogLevel to dump diagnostics for the Apache instance to a rewrite.log file.  If you can’t do this then another trick is to use environment variables to pass parameters between rules and to the executing script, writing intermediate test strings to environment variables using the [E=VAR:val] constructs which are then accessible by the invoked script as the environment variable REDIRECT_VAR.

Because I use a Linux laptop as my PC, as I’ve discussed on previous articles, I can also use the strace system utility to log all of the system calls executed by the Apache child processes as follows:

# Start an strace on all www-data children
sudo rm /tmp/strace.*
sudo strace  -u www-data -tt -ff -o /tmp/strace $(ps -o "-p %p" h -u www-data)

This strace diagnostic enables me to look at the relative timelines, and file / directory access and the writes to the rewrite.log can then be tied up to the corresponding log records.  This information, plus code inspection of the mod_rewrite source enables me to retro-engineer the module’s processing.

I can examine the timelines, what processing takes place, for roughly how long and when, and what I/O takes place.  Unfortunately doing this is really in the skill domain of an experienced developer and probably a bit daunting for the average SHS user.  However, the reasons that I’ve included this footnote are (i) for those readers who would like to drill down themselves to understand what is going on in the rewrite processing; and (ii) to underline that the recommendations that I make in the preceding section are not simply a matter of opinion, but are underpinned by hard timing and file access data.