What the “Failed Requests” counter in ARR really means
While troubleshooting an intermittent performance issue recently, the question came up: “What does the Failed Requests” counter in the Monitoring and Management feature in Application Request Routing (ARR) mean?”
For example, does a failed health check cause the counter to climb, or how about 404 or 500 status codes? What happens if all nodes are unhealthy and a 502.4 error is thrown?
I didn’t know the answer to all of these so I set out to find the answer.
Microsoft’s Answer
First, what does the official documentation say:
Failed Requests
Displays the number of requests that failed, including requests that are a result of a connection error or a status code that matches a live traffic failure code.
That’s fairly helpful, but you have to read in-between the lines to know for sure.
Types of Errors
I would categorize errors into three categories for sites handled by ARR:
- those that fail in ARR
- those that hard fail on the web server
- those that soft fail (very slow, or timeout)
Obviously a slow page that still works won’t trip any counters, so I won’t worry about that.
ARR itself will serve up one of two different types of errors:
- 502.3 – timeout on the page requested. Default timeout is 30 seconds.
- 502.4 – no web servers are available to take your request. i.e. the health test failed on all nodes, or all nodes have been manually disabled.
Additionally, the web server can send back a response that is passed right through to the user. Four common status codes are:
- 200 –> Success. Everything is good
- 302 –> Found. Basically a temporary redirect.
- 404 –> Page not found. That says it all.
- 500 –> Server error. Usually a code/application related error.
ARR’s “Failed Requests” Counter
After testing, here is what I concluded, which lines up with Microsoft’s documentation:
502.3 Timeout | This will increment the “Failed Requests” counter. |
502.4 No healthy servers available | This will not increment any counter, including “Current Requests”. It’s a failure before it gets to ARR’s Monitoring and Management stats. I tested both with all servers manually disabled, and also when all servers marked as unhealthy from the health test. The results were the same. |
Health test | The “URL Test” will not change Failed Requests (or any requests for that matter) |
500 status code from web node | This will increment the “Failed Requests” counter using the default settings. (more on this below) |
404 status code from web node | This will not increment the “Failed Requests” counter with the default settings. (more on this below) |
Going back to Microsoft’s documentation, they say this: “or a status code that matches a live traffic failure code.”
In ARR’s Health Test, there are 2 types of tests. The first is the URL test, and the second is the Live Traffic Test.
The URL test will make a call to your server at specified intervals and mark a server as unhealthy if it doesn’t receive a valid response. It will bring it online again after it receives a successful status.
The Live Traffic Test watches the traffic on the way through and can mark a server as unhealthy when it sees too many bad responses within a set timeframe.
By default, neither are set. I highly recommend always setting the URL Test. However, for the Live Traffic Test, be careful, because it’s possible for someone to find a page that is throwing a 500 error and hit it aggressively and take all of your servers out of rotation. It’s an easy DOS attack. Additionally, if you have the live traffic test enabled but don’t have a URL Test set, then it won’t know when to mark the server as healthy again since there isn’t any live traffic to check. So, 2 rules of thumb: A) don’t use the Live Traffic Test unless you are sure you need it and B) if you do use the Live Traffic Test, never use it without also using the URL Test.
The Live Traffic Test is disabled by default since the Failover period (seconds) is 0. With that at zero, the live test doesn’t mark a server as unhealthy. However, it is still used for the Failed Requests counter.
Notice that the Failure Codes defaults to 500-. That means that all status codes 500 and higher are considered failures.
For testing, I dropped that to 400- and hit non-existent pages (404 status code) a few times. The Failed Requests counter climbed exactly along with my tests.
Interestingly enough, if the Failure Codes is set to 999-, then a 502.3 error still increments the counter. So a 502.3 error on the ARR node is an exception and will always increment the Failed Requests counter.
So, the Failure Codes value determines which status codes from the web server will increment the Failed Requests counter. By default it’s status codes 500 and greater.
Conclusion
Microsoft says it like so:
Displays the number of requests that failed, including requests that are a result of a connection error or a status code that matches a live traffic failure code.
To give a more verbose answer, I’ll conclude with the following:
The Failed Request counter in ARR will increment if there is a connection or page timeout while ARR waits on the web server (502.3), or if there is a status code returned from the web server that is equal to the Failure Codes value in the Live Traffic Test. Health tests and “no servers available” errors (502.4) do not update the counters.