RSS Feeds

More on optimising PHP applications in a Webfusion shared service

This article is a follow-up to Performance in a Webfusion shared service and guidelines for optimising PHP applications.  I decided to have a look at general postings on PHP performance optimisation, and having used Google to wander around the Internet and blog-sphere, I was quite frankly stunned by some of the absolute rubbish that I found out there.  I stand by what I said in this previous article but given some of the advice floating out there and the total lack of quantitative measures which underpin this, I felt that I just had to do some quick benchmarks to refute some of this more ludicrous advice.  Examples abound but this post, 50+ PHP optimisation tips revisited, is one of the better within this misleading school. This blog post, “Micro” Optimizations That Matter, is far closer to my view.

Let me quantify my views with some ballpark figures.  I need to emphasise that the figures below are indicative.  I have included the benchmarks as attachments to this article, just in case you want to validate them on your own service.

I generated these figures by running the scripts below on my Webfusion shared service when it was lightly loaded.  My 2-year-old laptop, running Ubuntu with its own LAMP stack, is ~50% faster and my development server is ~100% faster.  Since my shared service is running on an 8 core Intel E5410 Xeon server, this slowness is due to the contention with other users that still occurs even on a fairly idle server.  As the load and the contention on the server increases, the latency and effective throughput per request collapses. (See my article Performance Collapse in Systems where I discuss this in more depth.)  The net result is that these ratios might vary as the above numbers collapse.  However, in rough terms, if the file system cache isn’t primed and I use the time taken by the PHP interpreter to read in a single PHP source file as a baseline: it can compile ~10,000 lines of source code; it can execute ~500,000 PHP instructions; it can call ~20 MySQL statements.  OK, if another request hits the same code path within a few minutes, then the file system cache will probably be primed and these figures will improve to roughly ~150 lines of source code, ~10,000 PHP instructions, and ~0.5 MySQL statements.

Of course PHP statements can compile to multiple PHP bytecode instructions; code density in a source file can vary; and you can die on a SQL statement if you haven’t got the correct indexes in place or are using a dumb execution plan.  But it is surprising how the weak law of large numbers blends much of this variation out.  The main point that I want to emphasise here is a comparison of orders of magnitude, not the odd 50% error due to sampling code density etc. or the subtle differences between one system and another.  From these data, I can suggest some simple guidelines:

To put this all in context, I did an strace of my blog index page script on my laptop. An analysis of this gives the following broad times:

So the actual execution of PHP opcodes only involves a few percent of the total script execution time, and what you do here in the way of optimisation is largely irrelevant. The two big delays that you do have control over are (i) the number of physical I/Os needed to run your script and this is largely related to the number of files needed to be read, and (ii) the php interpreter start-up over heads which you can only avoid if you use html caching as I describe in my next article.

An anecdote about algorithmic optimisation

Over a decade ago, my project team was implementing a system for one of the major oil companies, and over the course of the project my main client had became a good friend.  On one visit to his site, I found him battling with a large and expensive non-linear optimisation system that his company had funded, and they had just flown in a consultant from the developer who had spent a week getting the runtime down by about 25% from ~2 days computer time per run for the benchmark test case.  This was still far too long to be usable and they were considering cancelling the entire programme.  He knew that I was had a reputation in my company for being an optimisation guru, so by way of a personal favour, he asked me to spend one Saturday afternoon to see if we could speed the code up.

I did a quick execution profile, and saw that the application was spending over 95% of the time in a single loop in one module.  When I looked at the source code, I could see that the the developer had “improved” a sort of huge in-memory array by adding short-circuits to its inner-most loop.  The code was a bubble sort, possibly one of the slowest sorts documented.  So I ripped out all the crap to slim the inner loop, down from ~100 lines to ~20 lines.  I then added the extra 5 lines to convert it from a bubble sort to a shell sort.  Why? It was a quick and simple ‘peephole’ change, and because the underlying data structures were FORTRAN sparse arrays, it wasn’t easy to replace the custom code by a call to a standard library sort routine.  However, this moved the runtime for this this sort from O(N2) to better than O(N3/2), dropping the total runtime for the full test-case down to less 45 mins.  Another execution profile revealed that the code was now I/O bound in another loop in another module.  One more code change to replace an expensive I/O operation, and now the runtime was down to less than two minutes. I reckoned that we could have done another round to break the one minute barrier, but we decided that one day to two minutes in 4 hours work was enough, and went for some celebratory beers instead.  After all this, changing a few dozen lines in some 300,000 dropped the runtime from days to minutes and all for the price of a few beers.  Algorithmic optimisation wins every time.

The Benchmarks

Health warning: these are ‘as-is’ ‘quick and dirties’. I haven’t implemented security countermeasures so access protect them or remove them as soon as you’ve completed your benchmark.  So often when I have been asked to review projects with performance problems, I found that no one in the development team had a basic grasp of the relative performance of the different sub-systems and aspects that contribute to the end-to-end response that users experience.  As a result they spent a lot of wasted effort “optimising” components that are quite frankly irrelevant to overall performance and missing entirely the key ones.  These benchmarks aren’t intended to answer Qs like “is system A 5% faster than system B?”, but rather “is component X 1, 10, or 100 faster than component Y?