Last week I had some fun when I moved my blog rewrite from my home development system to the Webfusion service. This was a quick copy of a PHP script tarball and a mysqldump of the updated database. A couple of quick commands unpacked these into the the public_html subdirectory and the database. Easy, I though: everything seemed to be fine until I noticed a familiar artefact of data migrations involving UTF-8 encoded data: that is conversion of the characters outside the base 128 character ASCII set. For example, the title of one article is “My blog’s templating engine”, and this was rendered as “My blog’s templating engine”. The last time that I’d come across this was during database migration of UTF-8 encoded tables in a Latin-1 database into a UTF-8 one. After quite a few false starts and binary chops, the penny dropped when I saved a browser file locally and reloaded the local copy: the artefacts had vanished. So what was happening? I recalled a line that I noticed in the <VirtualHost> directive for Webfusion services:
# Not sure why Mike put this in but it's there so removing it will probably # screw stuff up. AddDefaultCharset ISO-8859-1
So I double-checked the AddDefaultCharset in the Apache documentation:
This directive specifies a default value for the media type charset parameter (the name of a character encoding) to be added to a response if and only if the response’s content-type is either text/plain or text/html. This should override any charset specified in the body of the response via a META element, though the exact behavior is often dependent on the user’s client configuration. A setting of AddDefaultCharset Off disables this functionality. AddDefaultCharset On enables a default charset of iso-8859-1. Any other value is assumed to be the charset to be used, which should be one of the IANA registered charset values for use in MIME media types. For example:
AddDefaultCharset utf-8
AddDefaultCharset should only be used when all of the text resources to which it applies are known to be in that character encoding and it is too inconvenient to label their charset individually. One such example is to add the charset parameter to resources containing generated content, such as legacy CGI scripts, that might be vulnerable to cross-site scripting attacks due to user-provided data being included in the output. Note, however, that a better solution is to just fix (or delete) those scripts, since setting a default charset does not protect users that have enabled the “auto-detect character encoding” feature on their browser.
(My bold for emphasis). So what “Mike” was doing by adding this, goodness knows. By default, this breaks any UTF-8 HTML output that uses a META tag to set the content type. There are two simple work-arounds: the first is to add a
AddDefaultCharset Off
to your relevant .htaccess file, and the second is to add an explicit
header( ‘Content-Type: text/html; charset=UTF-8’ );
to your code. Either approach will override this VirtualHost directive. Job done.