Definitions of common terms and explanation of CopperEgg alerting

Timeout: This is reported when a CopperEgg test station sends an HTTP request (as defined in the probe definition), and after 10 seconds, no response is returned. The CopperEgg 10 second timeout period is not configurable today.

When a timeout occurs on a test station, it affects the uptime score and the Health score. 

A single test station reporting a timeout would not cause the alert defined to fire. In each case that the alert fires, a minimum 3 independently operating test stations from 3 locations and 3 cloud providers around the world time-out waiting for a response to request, during the same time intervals.

 

These same test stations run tests against thousands of web servers and ports, including those of CopperEgg web servers, as well as against Google.com. And the data for copperegg services and Google is gathered every 15 seconds.

About definitions and nuances:

A 10 second timeout when waiting for an HTTP response IS arbitrary; there is no standard that says 'if 10 seconds passes without a response stop waiting and call it a timeout.' It is entirely possible that all of the responses WERE returned, for example, 12 seconds after the request. Unfortunately, as mentioned above, the 10 second timeout is not configurable.

 

One other nuance here is the alert definition:

 

We consider “service down” alert as an example:

Average uptime = 0% for at least 5 minutes tested from a minimum of 3 probing stations.

CopperEgg alert processing will proceed as follows:

- Periodically, the previous 5 minutes worth of samples is analyzed

- For the alert definition, an average is calculated from data samples from all test stations

- If there is an average of 0 within the previous 5 minutes, the alert will be triggered if not already triggered.

  

If primary concern is uptime reporting, one other approach, which is what we do at CopperEgg:

- Test probes at 15 second intervals, from all test stations

- Modify 'Service Down' alert to the following:

- alert me when maximum %uptime is less than 5% for at least 5 minutes

What will this achieve?

With these settings, tests will be conducted every 15 seconds from many stations, as opposed to one check every 5 minutes from a minimum 3 stations.

For services to be inaccessible for 5 minutes straight from all locations ... that would require

 20 * 4 = 80 samples to ALL timeout (or nearly all) for Service Down to trigger as compared to 5 *4 =20 samples if only 3 monitoring stations are selected.  

 

Powered by Zendesk