More on optimising PHP applications in a Webfusion shared service

This article is a follow-up to Performance in a Webfusion shared service and guidelines for optimising PHP applications. I decided to have a look at general postings on PHP performance optimisation, and having used Google to wander around the Internet and blog-sphere, I was quite frankly stunned by some of the absolute rubbish that I found out there. I stand by what I said in this previous article but given some of the advice floating out there and the total lack of quantitative measures which underpin this, I felt that I just had to do some quick benchmarks to refute some of this more ludicrous advice. Examples abound but this post, 50+ PHP optimisation tips revisited, is one of the better within this misleading school. This blog post, “Micro” Optimizations That Matter, is far closer to my view.

Let me quantify my views with some ballpark figures. I need to emphasise that the figures below are indicative. I have included the benchmarks as attachments to this article, just in case you want to validate them on your own service.

20–40. The number of files that you can open and read per second, if the file system cache is not primed.
1,500–2,500. The number of files that you can open and read per second, if the file system cache is primed with their contents.
300,000–400,000. The number of lines per second that the PHP interpreter can compile.
20,000,000. The number of PHP instructions per second that the PHP interpreter can interpret.
500-1,000. The number of MySQL statements per second that the PHP interpreter can call, if the database cache is primed with your table contents.

I generated these figures by running the scripts below on my Webfusion shared service when it was lightly loaded. My 2-year-old laptop, running Ubuntu with its own LAMP stack, is ~50% faster and my development server is ~100% faster. Since my shared service is running on an 8 core Intel E5410 Xeon server, this slowness is due to the contention with other users that still occurs even on a fairly idle server. As the load and the contention on the server increases, the latency and effective throughput per request collapses. (See my article Performance Collapse in Systems where I discuss this in more depth.) The net result is that these ratios might vary as the above numbers collapse. However, in rough terms, if the file system cache isn’t primed and I use the time taken by the PHP interpreter to read in a single PHP source file as a baseline: it can compile ~10,000 lines of source code; it can execute ~500,000 PHP instructions; it can call ~20 MySQL statements. OK, if another request hits the same code path within a few minutes, then the file system cache will probably be primed and these figures will improve to roughly ~150 lines of source code, ~10,000 PHP instructions, and ~0.5 MySQL statements.

Of course PHP statements can compile to multiple PHP bytecode instructions; code density in a source file can vary; and you can die on a SQL statement if you haven’t got the correct indexes in place or are using a dumb execution plan. But it is surprising how the weak law of large numbers blends much of this variation out. The main point that I want to emphasise here is a comparison of orders of magnitude, not the odd 50% error due to sampling code density etc. or the subtle differences between one system and another. From these data, I can suggest some simple guidelines:

If you only execute a statement 0 or 1 time in the module, then micro-optimisation is absolutely irrelevant. Clarity and understandability of the code is far more important.
If you only execute a statement less than a dozen or so times on average then the same still applies.
Subject to keeping the code clean and easy to read, the more brief it is the better.
If you have bulk code that isn’t “main path” (that is there is a small probability that it will be executed), then consider loading it dynamically when and if needed.
Never try to implement PHP coded optimisations of functionality that can be crisply written using standard PHP functions: the code will be shorter, the bytecodes less and your runtime will involve more compiled C++ rather than PHP interpreted bytecode instructions. Consider a line in my example, Running remote commands on a Webfusion shared service. Should I replace this with a for loop? Why? This is the fastest and most economical way of doing what I want, so even though the extra runtime of the for loop doesn’t matter a damn, it still doesn’t make sense dumbing this code down. On the other hand if I was a novice and unfamiliar with the function libraries so had used a loop to implement this, would this matter in performance terms? No it would not.
```
$command = implode( ' ', array_slice( $argv, 1) );
```
If you are executing a block of statements more than a few dozen times, then still don’t consider optimisation unless the code path is used frequently (say on >10% of requests to your website). If you do feel it necessary to optimise, then the fist question that you should ask is whether the algorithm which underpins the code is the right one, not how you should micro-optimise the loop. This is particularly the case where any SQL statements are involved in the loop. Algorithmic optimisation will always out-perform micro-optimisation hands-down. However, optimised algorithms may be more complicated and therefore less understandable and bug-prone, so practice KISS (keep it simple, stupid) except where the user request is frequent one and it generates perceptibly long latencies in response.

To put this all in context, I did an strace of my blog index page script on my laptop. An analysis of this gives the following broad times:

113 ms – php interpreter image activation. This involves a lot of I/O but this was entirely cached. Given that the php environment will be continuously reloaded on an suPHP LAMP stack, this is typical. Of course, you wouldn’t have this overhead if your server was using mod_php or mod_fastcgi, but this typically isn’t the case on a shared service.
23 ms – read and compile the index.php script. As I discuss in a later article, I glob all of the script components making up the index script into a single file some 700 lines long to minimise the physical I/Os needed to read in the script. Without such globing the physical I/Os needed to load these files could easily add half a second or more delay on the server. This overhead would normally be removed in a LAMP stack which used a PHP opcode cache such as APC, but again this typically isn’t the case on a shared service.
5 ms – the SQL transactions to load the blog config and the articles on the index page
1 ms – processing the page.
1ms – template overhead for preparing the page output
3 ms – output of the page.

So the actual execution of PHP opcodes only involves a few percent of the total script execution time, and what you do here in the way of optimisation is largely irrelevant. The two big delays that you do have control over are (i) the number of physical I/Os needed to run your script and this is largely related to the number of files needed to be read, and (ii) the php interpreter start-up over heads which you can only avoid if you use html caching as I describe in my next article.

An anecdote about algorithmic optimisation

Over a decade ago, my project team was implementing a system for one of the major oil companies, and over the course of the project my main client had became a good friend. On one visit to his site, I found him battling with a large and expensive non-linear optimisation system that his company had funded, and they had just flown in a consultant from the developer who had spent a week getting the runtime down by about 25% from ~2 days computer time per run for the benchmark test case. This was still far too long to be usable and they were considering cancelling the entire programme. He knew that I was had a reputation in my company for being an optimisation guru, so by way of a personal favour, he asked me to spend one Saturday afternoon to see if we could speed the code up.

I did a quick execution profile, and saw that the application was spending over 95% of the time in a single loop in one module. When I looked at the source code, I could see that the the developer had “improved” a sort of huge in-memory array by adding short-circuits to its inner-most loop. The code was a bubble sort, possibly one of the slowest sorts documented. So I ripped out all the crap to slim the inner loop, down from ~100 lines to ~20 lines. I then added the extra 5 lines to convert it from a bubble sort to a shell sort. Why? It was a quick and simple ‘peephole’ change, and because the underlying data structures were FORTRAN sparse arrays, it wasn’t easy to replace the custom code by a call to a standard library sort routine. However, this moved the runtime for this this sort from O(N²) to better than O(N^3/2), dropping the total runtime for the full test-case down to less 45 mins. Another execution profile revealed that the code was now I/O bound in another loop in another module. One more code change to replace an expensive I/O operation, and now the runtime was down to less than two minutes. I reckoned that we could have done another round to break the one minute barrier, but we decided that one day to two minutes in 4 hours work was enough, and went for some celebratory beers instead. After all this, changing a few dozen lines in some 300,000 dropped the runtime from days to minutes and all for the price of a few beers. Algorithmic optimisation wins every time.

The Benchmarks

Health warning: these are ‘as-is’ ‘quick and dirties’. I haven’t implemented security countermeasures so access protect them or remove them as soon as you’ve completed your benchmark. So often when I have been asked to review projects with performance problems, I found that no one in the development team had a basic grasp of the relative performance of the different sub-systems and aspects that contribute to the end-to-end response that users experience. As a result they spent a lot of wasted effort “optimising” components that are quite frankly irrelevant to overall performance and missing entirely the key ones. These benchmarks aren’t intended to answer Qs like “is system A 5% faster than system B?”, but rather “is component X 1, 10, or 100 faster than component Y?”

Compilation Benchmark.

<?php
header('Content-Type: text/plain; charset=iso-8859-1');
$cycles = (int) $_GET['cycles'];
$repeats = (int) $_GET['repeats'];
$codeTemplate = preg_replace( '/^.*?class \w+ /s', '', file_get_contents( "$_GET[file]" ), 1);
$nLines = count( explode( "\n", $codeTemplate ) ); 
$totalTime = 0.0;
for( $i = 0; $i < $cycles; $i++ ) {
   $code = '';
   for( $j = 0; $j < $repeats; $j++ ) $code .= "class test_{$i}_{$j} " . $codeTemplate;
   $start = microtime( true );
   eval( $code );
   $totalTime += ( microtime( true ) - $start );
   unset( $code );
}
$linesPerSec = (int) ($cycles*$repeats*$nLines / $totalTime);
echo "$cycles\t$repeats\t$totalTime\t$linesPerSec\n";

Execution Benchmark.

<?php
header('Content-Type: text/plain; charset=iso-8859-1');
$cycles = (int) $_GET['cycles'];
$repeats = (int) $_GET['repeats'];
$code = '$i = 0; $j = 0; $t = 0; while ( $i < $n ): $a = "";';
for( $j = 0; $j < $repeats; $j++ ) $code .= '$a .= "c"; $j++;';
$code .= '$t += strlen( $a ); unset( $a ); $i++; endwhile; return $t;';
$instructions = 4 + $cycles * (6 + $repeats * 2);
$benchmarkFunc = create_function( '$n', $code );
$start = microtime( true );
$checkCount = $benchmarkFunc( $cycles );
$totalTime = ( microtime( true ) - $start );
echo "$cycles\t$repeats\t$checkCount\t$totalTime\t", (int) ($instructions / $totalTime), "\n";

File Access Benchmark.

<?php
header('Content-Type: text/plain; charset=iso-8859-1');
$path = $_GET['path'];
$content = array();
$totalSize = 0;
if( $handle = opendir( $path ) ) {
   echo "Path: $path\nDirectory handle: $handle\nResults:";
   $start = microtime( true );
   while( false !== ( $file = readdir($handle ) ) ) {
      if ( strpos( $file, '.' ) !== false ) {
         $content[$file] = file_get_contents( "$path/$file" );
         $totalSize += strlen( $content[$file] );
      }
   }
   $totalTime = microtime( true ) - $start ;
   closedir($handle);
   $filesPerSec = count( $content ) / $totalTime;
   echo count( $content ), "\t$totalSize\t$totalTime\t$filesPerSec\n";
}

MySQL Benchmark.

<?php
header('Content-Type: text/plain; charset=iso-8859-1');
$n = (int) $_GET['repeats'];
echo "SELECT COUNT($_GET[col]) AS count FROM $_GET[table]\n";
$start = microtime( true );
$db = new mysqli( $_GET['host'], $_GET['acct'], $_GET['pwd'], $_GET['db'] );
$rs = $db->query( "SELECT COUNT($_GET[col]) AS count FROM $_GET[table]" );
$count = $rs->fetch_assoc(); $rs->close();
$max = $count['count']-1; echo "max id = $max\n";
for( $i = 0; $i < $n; $i++ ) {
   $id = rand ( 1 , $max ); 
   $sql =  "SELECT * FROM $_GET[table] WHERE $_GET[col]=$id\n";
   $rs = $db->query( $sql );
   $row = $rs->fetch_assoc(); $rs->close();
}
$totalTime = ( microtime( true ) - $start );
echo "$n\t$totalTime\t", ($n+1)/$totalTime, "\n";

An anecdote about algorithmic optimisation

The Benchmarks

You Might Also Like

Using .htaccess files on a Webfusion shared service

phpBB Performance – Reducing the script load overhead

Creating a test VM to mirror the Webfusion Shared Service

Leave a Reply Cancel reply