Overview of Application Request Routing’s Health Check features

I had a recent question about the ARR health check features which resulted  in this overview. Hopefully this is helpful to those of you setting up new ARR deployments.

URL TEST Feature

The URL Test feature tests against a specified URL for the following conditions. If any of the conditions fail the server that failed the test will be taken offline.

  • A response was received within the configured timeout period
  • The HTTP status meets the configured acceptable status codes
  • The body of the response contains the specified text configured in the response match.

If the URL is set to the FQDN of the ARR server then the tests will be performed against all servers configured in the farm. 

Servers can be either brought online manually or when the URL test determines the server is healthy again after then next successful test run.

As this feature is limited to only one URL it is recommended to create a smoke test page that is a representation of overall health of the server.   ARR can be further configured to look for specific words in the response entity body, so that the results of the smoke test can be taken into the consideration for determining the health.

The feature is disabled if there is no URL present in the dialog.

UI Settings

Configuration Attribute

Description

URL <attribute name="url" type="string"/> URL of content you want to perform the check against. Since the test is designed to read the contents of the response,  a text file is sufficient. i.e. http://(server/ name or FQDN of ARR server)/healthCheck.txt
Interval <attribute name="interval" type="timeSpan" defaultValue="00:00:30" validationType="timeSpanRange" validationParameter="0,86400,1"/> How often the test is performed
Time-out(seconds) <attribute name="timeout" type="timeSpan" defaultValue="00:00:30" validationType="timeSpanRange" validationParameter="1,86400,1"/> Configurable timeout period to receive a valid response. 
Acceptable status codes <attribute name="statusCodeMatch" type="string" defaultValue="200-399"/> Range of acceptable HTTP status codes that signify success.
Response Match <attribute name=”responseMatch” type=”string” caseSensitive=”true”/> String that will be contained in the body of the successfully response. If the string “Healthy” was entered here the response body must contain this string.

Live Traffic Test (enabled by default)

Will mark a server unhealthy if the following conditions are met.

  • If the HTTP response matches the configured failure code range as configured by liveTrafficFailureCodes
  • And If there are X # of failures as configured by maxLiveTrafficFailures
  • And if the # of failures occur during the configured time per liveTrafficFailurePeriod.

To disable the feature set the Failover period to 0.

UI Settings

Configuration Attribute

Description

Failure Codes <attribute name="liveTrafficFailureCodes" type="string" defaultValue="500-"/> Range of Http status codes that signify failure.
Maximum Live Traffic Failures: <attribute name="maxLiveTrafficFailures" type="uint" defaultValue="10" validationType="integerRange" validationParameter="1,4294967295"/>

Specifies the maximum number of failures that are allowed during the failover period.

Failover period (seconds) <attribute name="liveTrafficFailurePeriod" type="timeSpan" defaultValue="00:00:00" validationType="timeSpanRange" validationParameter="0,86400,1"/>

Specifies the failover period in seconds. To disable the Live Traffic test, set this value to 0.

Minimum Servers

<attribute name="minServers" type="uint" defaultValue="0"/>

       

Set as a percentage of Healthy servers. When the number of healthy servers drops below this number the following takes place.

An Event is raised and logged to the event viewer

Requests are route to all servers regardless of their status except for servers that were taken offline manually.

What happens if all Servers are offline?

The user will receive the following error in the browser:

image


Failed Request Tracing logs will log the error and reason below.

image
 

Summary

When setting up your environment its important to note that the URL Test feature can be used to both mark a server to unhealthy as well as healthy ,while  the Live Traffic test is only used to mark a server as unhealthy.

The reason for this is that the URL test has no user impact and there is no risk to the user experience. If you experience transient outages where you want to recover  via health checking features it would be recommended to use a combination of both the Live Traffic and the URL test. If you want to failed servers to stay offline until some administrator action then just use the Live Traffic test.

References

ARR Health Test Page

Verify URL Test Dialog Box

Monitoring and Management Page

1 Comment

  • I have a question regarding the health check.

    I have ARR set up with a 2-node farm. The health check fires every 15 seconds and requests the "/home" page from each node on port 8080 using the FQDN of the site.

    Every 15 seconds in my applications log file I get a SocketException, "connection reset by peer".

    Is there a reason that ARR makes a request to the application server, then once it gets the 200 OK response it breaks the connection before it gets the whole response?

    This is filling my log files up with the same error message and stack trace to about 100MB per day, when previously they were about 150k each day.

    Thanks,
    Brad

Comments have been disabled for this content.