Reducing Complexity in Systems Monitoring

In this ControlUp “how-to” article we take a look at the Stress Level feature. Stress Level addresses the inherent complexity of performance monitoring by combining several performance metrics into one intuitive health index.

In a complex IT system, such as a VDI environment with hundreds of machines, proper monitoring can ensure smooth operations. However, the ability to see the big picture and quickly locate and solve issues remains a significant challenge. Take a basic scenario of CPU usage reaching 70%, for example. This could mean that the machine is overloaded, but it may also be a sign of efficient utilization. Administrators need to check several performance metrics in order to understand the severity of an issue. Understanding the bigger picture especially when time is limited, is only possible by combining different performance metrics into one that we, at ControlUp, call the Stress Level.

In this article, we will elaborate on our approach and share best practices regarding how to keep an eye on dozens of different metrics and gauges simultaneously.

What Is ControlUp’s Stress Level?

Stress Level is the aggregation of multiple metrics into one readable statistic. In order to simplify the complexity of performance monitoring, systems usually have the option to combine multiple metrics into one. You can see in the example below that different metrics contribute to the Stress Level according to set thresholds and weights.

Here we can see metrics on certain machines, but through Stress Level you can easily gauge users and processes as well.

How it Works

When analyzing monitoring metrics, you want to be able to know that a computer (or user or process) is in trouble without having to look at a specific metric. The Stress Level is like the bottom line of your credit card statement. Checking your monthly credit card statement will alert you to further investigate which item had the heaviest weight. If a computer has just one metric contributing two points of stress, the Stress Level will be a total of two points, appearing as a medium Stress Level in yellow, as seen on the above dashboard. Each metric that is counted can contribute points to the total Stress Level of a monitored resource.

As shown below, metrics translate into weight points that ultimately determine the overall Stress Level, in this case of a single process.

Stress Level Customization

Each IT system has its own purpose, SLA and capability. As shown in the example above, performance is relative. Monitoring needs to be flexible and configurable, especially when looking at an aggregated Stress Level metric. Therefore, having the option to select different metrics that contribute to a Stress Level, along with setting individual contribution levels, is important. In the example below, we’ve customized the level of contribution according to the number of user sessions per machine. If the metric surpasses 60, it is marked in red and contributes two Stress Level points from all machines in a selected folder.

Is it Possible to Receive an Alerts Based on a Single Metric?

Cases may arise where you want to force a Stress Level alert due to a specific critical metric exception. For example, if there is less than 50MB of free space on your system drive, ControlUp allows you to set the specific metric that is taking up most of the space with a significantly high weight on the overall Stress Level.

Time Sensitivity – Acute vs Chronic Issues

In addition, as shown in the configuration window below, you can set an alert according to the duration of the event in order to reduce the amount of false alerts. For example, in the case described above, it is imperative to know as quickly as possible when a user session count passes 60. However, when dealing with high CPU usage, it is more effective to send alerts regarding the average usage rather than specific, one-time events. While the “current value” parameter is better suited to detect spikes, “average in history” is more appropriate to detect longer lasting anomalies within a system.

See the video below to learn how to configure your ControlUp Stress Level:

Triggers and Taking Action

Once you’ve configured the appropriate settings for the Stress Levels within your environment, it is important to optimize the effectiveness of your alert system. In order to distinguish between outstanding Stress Levels that deserve a second look and true Stress Level emergencies, you need to use different types of alerts. Using ControlUp, the former is alerted via e-mail while the latter is alerted via push notifications through the Mobile App.

For further information on triggers and alerts, please check the following:
ControlUp Trigger Events
Mobile Notifications