Stress testing IIS
As promised in my first blog, I am writing about stress testing IIS in this blog.
The idea behind IIS stress testing is to bombard the server with multiple simultaneous requests of different kind so that the CPU usage goes to 100%. This helps detecting memory leaks, access violations and deadlocks.
There is a fine line between stress testing and reliability testing. In stress testing, the response of the server for every request doesn't matter as long as we verify that a few of the responses are as expected. As I said the main aim of stress testing is to keep the CPU pegged without too much worrying about the response. In reliability also, we hit the server to keep it busy, but we don't want to hit it as hard. In reliability testing, we validate each and every response coming back from the server to make sure it is as expected.
Talking about stress, while the machine is being bombarded by requests, we make a few simple requests called Stress Verification Tests (SVTs) every 15 minutes and validate them to make sure server is responding as expected. We also run a script in parallel that fetches various performance counters like memory usage, handle count etc and validate them to make sure they are in the acceptable range.
We have setup the server and its multiple physically separate clients on the same rack so that we get enough throughput. To begin with, we create multiple sites, applications and app pools with a variety of settings on the server. Then we start our script to monitor performance counters. We make a single pass of SVTs before starting stress on the server to make sure all configuration is good. We the start the clients that use WCAT (available on iis.net for download). WCAT is web capacity analyzer tool that bombards the server using multiple threads. I will try to blog about WCAT in a separate blog.
We have multiple types of stress runs, some are overnight and some run for 3 weeks. When a run doesn't hit any failure for the duration of run, we consider that passed. Most of the failures get caught if an SVT fails or if there is an AV in the worker process or if there is a performance counter that is outside of expected range.
SVT failures can be quite challenging because we might have overstressed the server and there is no way server can respond to SVT request. These are just like a false alarm. There is not much anyone can do here.
For nightly stress runs, we have 100+ servers and each server has 1 or more clients. Managing all these machines is a big challenge. We have a nice logging mechanism that logs the status of these runs to a database and a web application as a UI. We also log few interesting performance counters to the database and view graphs of those performance counters in the web application that uses silverlight for charting. Worker processes run under debugger with some special debugger extensions that directly logs about failures to the database. Everything is automated!
Why do we need 100+ runs every night? The stress client uses different kinds of requests for different runs. Some runs use CGI requests, some runs use plain HTM requests, some runs use ASP, some runs use ASPNET, some runs use combination of these. It is not necessary that if one run that uses HTM requests fails, all will fail. Different machines have different hardware configuration, so one might fail but others might pass. We also have different settings in IIS config so that may give us different results. E.g. if FREB is enabled, server may get more stressed because worker process is doing more work per request.
Stress testing is overwhelming but I hope the reader of this blog got an idea of what it is and the challenges we face. In the next blog, I will write more about WCAT.