Tuesday, March 13, 2007

Response Time Distributions, IIS Log Files, and the question of the Missing Events

Over the last year I've been involved in a number of investigations attempting to find bottlenecks in systems consisting of clients, web service hosts, and databases (usually containing application logic in addition to data). The details of each system's implementation is not especially important to this discussion because what I want to do here is just relate one of my recent experiences regarding measurement of response times. You might like to check if you get similar behaviour.

Firstly, let's just describe the typical situation I find myself looking it. It's very generic, I'm sure the same thing will apply to you. I usually have some client systems accessing a service layer hosted in IIS6 talking to a database server (usually with significant embedded application logic).

The Problem
The problem (or my lesson in this case) is how to interpret the numbers you get from the IIS log files on the web service hosts. These files give you HTTP request duration and the arrival time of the request. Now what happens when you naively plot a distribution of the the duration (ie request execution time)? You might expect a nice symmetrical peak centered on some value, or you might get something such as the following.

I found that I was consistently getting this sort of shape across different types of requests. This isn't really all that unexpected. Each URI in the log file corresponds to a web service against which a number of web methods may be called. The web methods may have significantly different response times so the graph of the service call is really just a summation of all the individual web method calls. And in fact we've now implemented duration timing on each individual web method call, and in fact we get a much simplified distribution centred round one peak.

Anyhow, while I was looking into the double peak I decided to look at the arrival rate and compare that to the duration of the call. Theoretically you should get a Poisson distribution for random, uncorrelated events and on those occasions when there were many simultaneous arrivals you'd also expect the response time to slow down (although whether this was linear/non-linear is another question).

So, I looked for a period of time during which we have fairly constant activity and chose 2 hours in the middle of the day.

You can see that there's a nice distribution with the expected shape centred on 4 arrivals/second. Of course the IIS log files only record data when a request actually arrives. Looking at the bar graph you'd therefore naively expect about 300 one second intervals over the two hour period during which no requests arrived.

And this is where I got a surprising result.

So, it appears that the IIS HTTP arrival time is not accurate. I think that what I'm seeing here is that IIS is already queuing the requests up for processing - presumably because processing downstream is taking too long.

If you get anything like this, have seen it before, or have a bit more knowledge of what is going on I'd love to hear about it.


web2print guy said...

Same problem. Include any file upload and you get an even more skewed picture. I stopped relying on IIS logging for app profiling. What I do is an in-memory log that is flushed into a file on a schedule. The whole thing runs on a cluster and any given session can be spread accross any number of machines. An attempt to reconcile the app log with ISS by time proved to be inaccurate and I had to rely on other request parameters like the query string and user IP.

Then I just stopped using IIS logs for profiling and use the app logs instead. It's a bit of an overhead on the resources, but it's worth it. The only trouble is it's hard to track other resources that are served by IIS directly.


Bohdan Szymanik said...

Thanks Max, we've done the same - we're now using the times we obtain through an async call out to an event auditing database. We record the arrival time and processing completed time.