This paper describes a macroscopic characteristic of complex systems that can result in performance collapse in seemingly random and undiagnosable ways. In recent months I have had to troubleshoot Severity 1 failures on a worryingly large number of systems and services that we operate for our customers with service penalties (let alone remediation costs) reaching into seven possibly eight figures. Yet our technical teams seem to have a general ignorance of these issues, perhaps primarily because this topic is poorly documented in industry and academic treatments. Nonetheless, recognising the characteristics of this mode of failure is a key factor in both designing in its avoidance in the first place and also in the remediation of a failed system, hence my decision to take time out to write this paper.
Like it or not, a branch of mathematics is core to understanding such performance of computer systems and that is Queuing Theory. You must understand some basic aspects of this to understand how such systems behave. I happened to cover this in my university course, and then worked on a project in my early career modelling and analysing Army Command and Control Systems. Now you may think it somewhat bizarre that I mention such 1970s CCSs which involve hundreds of people talking on a bunch of radio channels when discussing the macroscopic behaviour of complex systems that could involve the interaction of thousands of computers over shared WAN infrastructure. However whilst our complex Service-Oriented Architectures (SOA) today might have time constants perhaps three orders of magnitude shorter, in terms of the models which underpin their mathematics and their macroscopic behaviours, these the systems all largely exhibit the same characteristics.
The next section of this paper provides the bare bones overview of queuing theory that you need to understand, but for those who want to delve deeper I have included a good reference as a starting point (i) for web accessible reference material for those interested, including for example a reference to Jean Yves Le Boudec’s Performance Evaluate course notes — heavy going but fascinating if you are a sad type like me. I then cover some of the real world considerations that you will need to consider in looking at major IT systems, and finally I introduce my twelve rules for attacking the analysis and remediation of such service failures.
Queuing Theory Basics
Queuing theory is the mathematical analysis of systems designed to receive and to process a stream of requests. Central to such analysis is the realisation that we can classify and mathematically analyse many systems which involve people or requests arriving in a system, their being held or queued for a period until they can be processed, their being processed, and finally leaving the system. It applies to banks, traffic jams, and computer systems. Key characteristics include the nature and characteristics of:
- The arrival process. Is it determinate or random? If random then what is the nature of the distribution? A key parameter here is the mean arrival rate λ.
- The service process. Again, is it determinate or random? If random than what is the nature of the distribution? A key parameter here is the mean service rate μ.
- The relative system load or traffic density ρ = λ/μ.
- The queuing mechanism itself. Is it single line, or parallel lines? Is it random? Are there fast tracks?
- The average number of requests queued N is derived from a mathematics of these interactions, and the mean service time T for a request is then derived from the simple and obvious relationship N = Tλ .
There are two cases where the mathematics is straight forward and produces a simple formula for N
- The so called D/D/1 case in which the arrival and service processes are both determinate (and in this case a fixed constant) and the queue is a single line — one in, one out — so N = 0 as long as ρ < 1. However if ρ > 1 then the queue will just grow indefinitely.
- The so called M/M/1 case in which the arrival and service processes are both Markov and the queue is a single line. A Markov process or distribution is one in which the separate arrivals are truly independent and have an equal probability of occurring in any time interval. This could loosely be described as pure random and it gives a negative exponential distribution for the time between events. Here N is a function of ρ alone: N = ρ/(1-ρ) as long as ρ < 1. Clearly if ρ ≥ 1 then there is no solution for N and the queue will just grow indefinitely.
To be honest, the mathematical analysis where λ and μ have more complex distributions or where there are multiple queues, priority schemes, etc. rapidly stalls in a complex mess and you need to resort to the sorts of simulation techniques that I used on my CCS project to perform detailed system characterisations. Even so, queuing systems still have three broadly characteristics with reference to this M/M/1 case:
- The traffic density ρ = λ/μ is the key parameter which determines the average queue length N. As ρ increases, so does the queue length in a characteristic knee bend shape, so that as ρ passes the knee both N and T rapidly increase.
- Real world system designs attempt to drive the queue characteristics towards the determinate case (that is zero length queues when ρ<1) by adding multiple lanes, fast tracks etc. These all have the effect of accentuating the knee and moving it up from the ρ ≈ 0.8 towards ρ = 1.
- The more uniform and less variable the distribution of λ and μ the sharper the knee. On the other hand a clustered and irregular the distribution of λ and μ will straighten the knee, and for a given ρ the queue length N grows faster. Arrival bursts cause a rapid build up in the queue, and the rate at which this decays back towards steady state is again determined by ρ.
The vast majority of real-world queuing systems have this same macroscopic characteristic: the traffic density vs. mean service time curve has some form of knee bend. Below this as the traffic density increases, the number of requests queued N and thus the mean service time T slowly but steadily increases. After passing the knee N and T rapidly explode. How sharp the knee is and where it is actually located on the ρ range can be improved by having multiple queues, priority schemes. It is degraded by bursty distributions. If you can keep ρ to the left of the knee the system works fine, but as soon as it moves to the right of the knee service times rapidly become unacceptable, and so the system behaviour is incredible sensitive to small changes ρ around this knee.
Real World Considerations
Physical queuing systems either simple ones such as a ticket desk or complex compound systems such as a highway traffic grid are easy to observe, and we regularly witness the seemly bizarre behaviours as the equivalent ρ or λ parameters flex over time: the line outside the girls’ toilet at the theatre; the time to go down the freeway and the impact that small things like school holiday or a car breaking down on the opposite carriageway can make. We are used to delays in this physical world.
In a modern computer application / system, you can find hundreds of servers and thousands of PCs running many thousands of processes in aggregate, all in communication in a modern computer network and all exhibiting such queuing characteristics, and response to the vagaries of demand. When such systems are correctly designed, the macro characteristics of the system as a whole should be stable and the overall behaviour should be understandable as an aggregate, in the same way as a physical system. Such good systems are also usually stably self-limiting. For a good example, the designers of IBM mainframe CICS often had engineering or applied mathematics degrees and understood this stuff (or at least enough of them did). CICS applications typically exhibit the sort of behaviour where as the overall usage of the system increases, so does ρ, and internal queues start to build in the system, and hence also the mean service time. As the system response slows, this limits the user population’s ability to load the system and a stalemate is reached. This is just the same as the rides at the amusement park: there is only a fixed amount of ride-time capacity. The more people at the park, the more time they end up standing in line.
However the more complex SOAs that we implement today have indirect and often hidden shared constraints which cause coupling between the various service components: memory and I/O in each system; the LAN on each site; the WAN linking the enterprise together. Worse, as I describe later, they also respond to queue build-up in such a way as to decrease the system capacity. The net effect of this is that the mathematics of chaos theory seems appropriate to characterise such systems.
- Today we have been lulled into complacency by Moore’s Law type growth in capacity, and we think that throwing a supposedly near infinite resource at an application means that ignore capacity planning issues. We forget that sub-systems, however complex, can still reach capacity constraints. When any one subsystem does reach such constraints, then unless the system as a whole is self limiting queues will grow infinitely and the performance of all dependent sub-systems will collapse.
- This is because when any individual subsystem does move into an operating regime where its queues are growing in an unbounded fashion, it can’t actually let them grow indefinitely. It has to manage the queue overload (explicitly or by some default) either by dropping requests, by rejecting them, or by passing a stop request to any upstream clients. This may work reasonably well in an environment which implements assured delivery of requests, end-to-tend flow control and a coherent time-out strategy.
- In the case of complex distributed systems and Service Oriented Architectures (SOA), you need to understand the interplay of stateful and stateless SOA systems and overall transactional integrity on the overall system architecture. If no flow control mechanisms exist between the subsystems to meter inbound demand then the overloaded system ultimately has to adopt a strategy of dumping inbound requests. This means in turn that the requesting sub-systems cannot assume assured delivery of service requests and must implement some form of time-out and retry logic. Thought also has to be given to ensure that transactions are correctly classified as idempotent or at-most-once. Get any of this wrong and your system could start to fail in weird and subtle ways — but only following the overload, which means that if you haven’t predicted and created this overload scenario in system test then you will be executing untested code paths and could thereby expose further errors.
- When this happens, the correct action is to identify the true bottleneck and analyse its characteristics. Whatever you do until you drive the traffic density ρ for this subsystem materially below its knee point, you will only change the nature of the failure mode and not remove it. You can do this by understanding the nature of the demand on the system and restructuring the wider system to reduce the mean arrival rate λ or by performing localised scaling to increase the mean service rate μ. Do anything else, and you will only fool yourself. Yes, you may be seeing weird errors relating to time-out, resource leakage on untested code paths, and other artefacts from he overload. Yes, they shouldn’t be there in a properly designed system, but they are also consequential failures and not the root cause. Find and remove the root cause and you will fix the service. Fixing the consquentials is a ‘nice to have’.
- Also be very careful about getting into hunt the knee games. I have often come across incident response teams which waste valuable time tweaking system parameters such as queue sizes, number of queues, queue priorities, etc. because for a very particular load threshold this moves the system from unstable to stable operation. This is usually a waste of time: the point is that the system is operating in the region of the knee, and this is not a good place for it to be.
- Once sub-systems start retrying service requests, this can result in a step increase in service requests, which in turn increases both λ and therefore ρ for the overloaded system. As a result otherwise transitory overloads can then form a latching overload state.
- You also need to be aware that applications which include retry logic may well (correctly from their point of view) mask out underlying subsystems that are failing, so the system remains stable in the sense that the transactions that complete do so correctly and that transactions that do fail do so with the expected failure codes. However if the dominant characteristic of the failure mode is that the overall mean service time T has increased by significant factors, and well beyond any decrease in the overall mean arrival rate λ, then this is a solid indication that, somewhere in the mesh of sub-systems that comprise the whole system, at least one sub-system has gone into queue overload.
My experience across the industry when I talk to application and system developers from across our industry is that they show a woeful lack of understanding of any of this. Too many have been brought up in the world of the GUI, rapid development and infinite processing resource at their finger tips. So for the remainder of this section let me give some real-life examples that I have come across:
- In a major manufacturing customer, we added some additional overnight processing to their SAP system. Despite having carried out an end-to-end capacity calculation, the system overloaded in test. Ethernet bandwidth had been identified as a potential capacity point, and the Sun Starfire domain that was overloaded had had an additional Gbit Ethernet card added to support this extra load. This system performance still collapsed. It transpired that the Ethernet device driver for the version of Solaris used in production was single threaded, and therefore the Ethernet bandwidth was actually not limited by the aggregate capacity of the hardware or network, but by the number of packets per second that the single SPARC processor could process, hence the bottleneck.
- On a separate incident in the same application suite the system used a large Oracle RDBMS. The Oracle RDBMS uses a dynamic execution plan optimiser, and during normal running the evolving pattern of load caused Oracle to switch execution plans on a frequently used query, with an order of magnitude increase in run-time.
In both cases the externally observable symptoms had no direct correlation to the sub-system overload eventually identified.
- In another example, a major service outage occurred on a service for a major client. The system had been correctly capacity sized for the predicted load. However a minor change was requested by the client to move a query to perform a wild-card search for all new opportunities in your area today onto the entry screen, and this resulted in a step change in the frequency of this transaction. Unfortunately the wild card search returned a results block that was larger than a preconfigured Unix threshold parameter for memory based queuing of requests in the Tuxedo TPM, and therefore these requests were spooled to disk. Because the throughput of disk spooled requests was approximately 1000 times less than that of memory based ones, the mean service rate of the query engine collapsed, and thus the back end system could only process perhaps 50% of peak demand, with mean service times increasing from the normal few seconds to tens of minutes.The upstream client systems had a time-out that was set to enable recovery from local site overload, but this was shorter than the query engine time-outs. As a result the edge systems were abandoning over 80% of the transactions in process (without clearing them down) before their being returned to ultimate user, so the user experience was not only that perhaps only 10% of requests were completing successfully and then with a response of minutes rather than seconds.
In this case, increasing the Unix parameter to avoid disk spooling and rebalancing the Tuxedo queue structures to avoid transaction discards for normal processing peaks restored the service.
- In the case of a managed print service failure, the primary culprit was the last minute addition to replace the use of the Windows send message functionality because of a known Denial of Service vulnerability here. The workaround involved adding a hidden poll loop to the user’s web interface, which was entirely satisfactory from a functional point of view, but it unfortunately resulted in a fifty-fold increase in load on the back-end cluster, with most resource usage scaling accordingly. Unfortunately the system sizing calculation was done before this change.By the time that 25% of the final user population have been migrated to the system, file-writes to the SAN were running at some 200 files per second. Now the performance of SANs for large volumes of small file writes is similar to that of direct attached disk, because such writes ultimately do need to be written through to the appropriate SAN RAID set. However by the 25% user population the system was only just about coping, except when an hourly I/O hog batch process kicked off on another application system (which also happened to have its SAN device mapped onto the same SAN RAID set). As a result the system collapsed on the hour, every hour for no discernible reason. This one was fixed by moving a batch of temporary session files into a temporary folder on the C drive and implementing a lazy write strategy (the files were often read and then overwritten with the same content) to drop the write load to SAN dropped by a couple of orders of magnitude.
- However when the system was then cranked up to full load in the labs, another bottleneck appeared at an equivalent load to 30% of the user population in the ISS and a SOA worker subsystem. This was attacked by adding processing logic to bypass processing for over 95% of the polls which could not return a result anyway.
These two changes combined impacted some 6 source modules and perhaps 100 lines of code and configuration change, but the net result was to decrease the overall load on the system by over an order of magnitude. If we hadn’t rigorously focused on eliminating the root cause, we could also have made dozens of other tweaks amounting to perhaps a thousand of lines of change for another odd 5% here and there — and for absolutely no point: a factor of 10 is good enough, and the smaller the change, the easier it is to regression test.
- In a recent desktop deployment, the PC build process had a particularly time consuming and inefficient script for resolving the address of the local “file and print” server which also acted as a local virus database stager. The deployment teams developed a workaround to speed up deployment, and that was to refresh the IP configuration whilst the PC was hanging in this script (which had a nice side-effect of crashing the script). The only problem was that in such circumstances the script error handling failed-safe to point at the central virus management server, and as a result over half the PCs in the estate pointed to this server. The consequence of this is that when the users logged in following a virus database update half the PCs initiated their update over the WAN not the LAN and this caused a lot of local WAN tails to max out in capacity. Hence the symptoms related to network overload.
Again based on my experiences of across our many systems, some other specific examples of subsystems than can overload include:
- Internal computer subsystems may be the actual constraint. Physical examples here include the I/O sub-systems, and individual hot I/O devices such as Hard Disk Drives (HDD) and Network Interface Cards (NIC). The basic system I/O counters giving rates and queue lengths can help locate the problem here.
- As a general consequence of the rapid increases in processor power, the CPU processing capacity itself is only rarely the limit, because networks, I/O, etc. tend to max out first. Yet this still does occur, particularly in CPU intensive applications such as database engines. Traditionally these would have been viewed as largely I/O intensive but the development of large memory caches and increasing use of client side scripting technologies that tend to generate dynamic SQL at run-time which needs to be compiled on per transaction basis has swung the balance towards processing power. One standard mechanism for mitigating the impact of queuing is to scale out the system capacity by creating many service processes (or threads which in this case are just lightweight processes). Multiple threads might significantly collapse end-to-end delay by helping to pipeline and to parallel up processing. However, ultimately an n-processor machine can only truly have n threads active in compute mode, so there is a turning point after having too many concurrent threads that are waiting to compute actually degrades performance. System CPU, Context Switching and Paging counters can help point these out, as well as some of the more esoteric and vendor specific counters which highlight cross CPU contention for semaphores and shared memory.
- Distributed systems which share processing over a number of systems often need to use Distributed Lock Management (DLM) to ensure consistency and update integrity across the overall system. DLM relies on synchronous RPC hand-shakes between systems. Hence end-to-end communications latency constrains EnQ / DeQ rates and creates a throttle that is not directly observable through Windows performance counters, since the Windows OS family does not implement a proper DLM as a core service. DLM constrained systems are particularly hard to diagnose in the Windows world.
- Database management systems (DBMS) often implement internal and proprietary DLMs to mange access to shared resources such database tables. Deadlock scenarios and ‘hot’ tables which causing lock cascades can act as a classic bottleneck on database systems. The approach to diagnosing these is common to all DBMS. Unfortunately the toolset tend to be proprietary and vendor specific.
- Storage management systems such as OS file systems and DBMS use a variety of techniques to handle growth in files, tables, indices, etc. involving caches and caching, buckets, maps, overflow areas, etc. to minimise the number of physical I/Os to do logical I/Os. However, if the operational staff are not carrying out the correct routine maintenance to defragment file systems, reorganise indexes, etc. then the ratio of logical to physical I/Os will slowly degrade over time. As a result even if the underlying logical transaction rate is constant, the effect service rate μ of the file system or RDBMS will slowly degrade over time leading to a hidden increase in ρ.
- Also watch out for unobserved applications changes. A good example here is a routine change to the Anti virus subsystem which results in a file type being moved into the category that merits AV scanning. Boy, does this slow down your logical I/O performance.
- As with the Tuxedo example above, other internal OS constraints or parameters in such system thresholds may trigger secondary constraints, for example poor memory management may cause excessive paging which then manifests itself as HDD I/O overload on the devices which contain the system pagefiles.
Two final factors which relate not so much to the systems themselves but to the user load and what operations itself does:
- You may have intrinsic variability or uncertainly in λ. Classic batch and transactional systems lent themselves to straight forward volumetric study. For example, a system might be designed to support 300 call handlers each with a standard working pattern. This might give statistically predictable utilization of the system, and a 95 percentile load that might only be a factor of 50%, say, over average loading.One of the major design challenges of portal technology which provides self service to an entire enterprise or consumer community is that the maximum potential population is, say, 20,000 with an average usage of 10 minutes per day. External factors might lead to large pulses in loading (You need to submit your expense claim by the 15th to get in this month’s pay.) As a result the 95 percentile in such scenarios might be a factor 5 over average load, the 99 percentile a factor of 15 over the average. Such variability has dire consequences for system stability if your architecture isn’t intrinsically robust to overload. In this last case even if you’ve oversized the system by a factor ten, this intrinsic variability will still mean that the system will regularly overload and die.
- You should consider application and business changes to flatten demand and reduce the variability of λ. For example, why do you think that mortgage companies now ask you what day of the month you want to pay your mortgage? Is it for your benefit? No, it enables them to spread processing demand over the month.
- If your system melts down in overload, you must implement a demand-side surge suppression function. (It’s better to display a put-off for a percentage of users, e.g. “System is busy; please try later” than let the system die altogether.) The threshold needs to be agreed between the business and the IT service provider based on a X percentile analysis which is based on a trade-off between the oversizing factor and the frequency and duration of it being in operation: a trade-off between needing over-capacity and impact on the user-experience.
- You need to beware of the Heisenberg Effect (or the false common derivation thereof) — that is the act of observing affects the experiment itself. To locate hotspots, you need to instrument. If you are not careful, then the act of instrumenting itself can change the behaviour of the system. A good example here is that the monitoring data has to be written somewhere. So you need to be careful if you are investigating the I/O sub-systems for example. I’ve also known of live production systems crashing because of log files filling the system disk.
Through these twenty odd points I have tried to elaborate a range of factors each one of which could account for say a 20% swing in ρ — a swing which enough to move a system from having perfectly acceptable response to one that is almost dead as far as the user experience is concerned. No matter what our executives or the customer tells you, this can and does regularly happen. Ideally the development teams should be experienced enough to avoid many of the pitfalls by design or by proper user-load testing. We should also have some defence in depth by having an effective operational capacity planning function to monitor live systems and detect potential overload before failure (but that’s a subject for another whitepaper). However both of these goals are hardly foolproof, so we will still have many systems and situations where we first become aware of the potential performance collapse when the system is already dead and the users screaming. So I lastly want to focus on are some basic ground rules for understanding how to detect and to diagnose such failures when they do occur.
Ellison’s Twelve Rules
- If Tnow >> Tnormal then your system probably has a hotspot queuing overload.
- Make sure that you have the system correctly baselined and instrumented before starting the search. With tedious regularity, I’ve found support teams distracted by the fact that panic changes were made to try to correct things, or that identical systems had in fact different patch levels or configuration changes. This doesn’t necessarily mean that you need to get them identical — just don’t assume that they are identical unless you’ve checked thoroughly.
- Focus first on finding the true point of overload. Then start working back systematically to the root cause.
- Don’t get sidetracked trying polish tiny lumps on the knee in the ρ : N curve, or hunting down secondary errors. As I pointed out in some of my examples, you can get all sorts of weird secondary errors. They are secondary. Unless they point to a clear primary overload note them and ignore them for now.
- Whatever the (irrational) pressure from your own executives and your customer don’t make “try it and see” changes to the system. Messing around with tuning parameters can often shift the shape of the overload and tweak the failure symptoms. Don’t. The only thing that will really fix this is finding the overloaded subsystem and targeting its ρ.
- Follow proper release management and change control in everything you do. How many times have I heard It was an emergency, so we had to do something quickly? No, we had to get the live operation back as soon as practical. Doing anything which doesn’t achieve this is at best a total waste of time; it could make thing worse.
- When you have identified the point of overload and tracked back to the root cause, look carefully at the business factors which are driving the mean arrival rate λ and work out what is the sensible maximum for λ over the next X months . Unless the design team are just plain sloppy and have not done their basic capacity planning, my experience is that the usual underlying reason for the overload is that the subsystem is under a load that it was never designed for. Often the customer loading was not correctly forecast or some other problem led to a small last minute change that missed its proper impact analysis. If you are going to do a quick fix you need to make sure that the fix will work until you can implement a properly planned development programme. One possible solution may be for the customer to change upstream business processes to drop λ.
- Work out the minimum impact change to get traffic density ρ to be no more than 0.8 for the X month window and implement it. The immediate objective is to recover a stable service. The least risk way of doing this is by minimal change. Leave everything else alone.
- Don’t assume. Verify. You will usually find that at least one assumption made along the way was in fact false, and that this false assumption played a key part in creating the error conditions. I am sorry to say the answer here is straight forward: never trust received assumptions as a given fact. Always check or seek collaborative evidence.
- If you need to make multiple changes, then try to implement the changes only one change at one time. Make sure that wherever possible you have a backup plan. I realise that things are usually very urgent, but only very rarely do two subsystems go critical at the same time. If you make too many changes at once, what happens if one counteracts another?
- Be prepared to iterate the above process – for complex failures. The time that you often need to iterate is when you can’t get back to the true root cause because some upstream service is overloading and masking it. Then you need to clear down the first because you to can’t see the wood for the trees.
- Make sure that you do a full Root Cause Analysis (RCA) to document all the findings and lessons learnt. Then make sure that you agree a programme to sentence them whilst the pain is still in the minds of the sponsoring executives. As part of the diagnosis and RCA you will undoubtedly uncover all sorts of things that might need fixing (e.g. time-outs and invalid failure modes). You may well have found out underlying architectural flaws. These all need to be either acknowledged as known bugs that are not going to be fixed or by scheduling the work to fix them in a timely manner.
This is an original work by Terry Ellison