“Noisy neighbor” is a term used to describe a cloud computing infrastructure co-tenant that monopolizes bandwidth, disk I/O, CPU and other resources, and can negatively affect other users’ cloud performance.
A few days ago, I had a conversation with a coworker about virtual desktops (that’s a thing that happens pretty often here at ControlUp because we ❤️ VDI). I mentioned the term “noisy neighbor” and they had no idea what I was talking about.
It’s funny how we become myopic in our worlds; I’ve used the term “noisy neighbor” for decades and had thought that it was standard-issue terminology in the VDI community. But after this conversation, I decided to ask around, and I was surprised to learn that this term is actually somewhat specific to the VMware community, though some other VDI vendors may use the term on occasion. In fact, I couldn’t find a single person with a VMware background who didn’t know the term.
So, what, exactly, is a noisy neighbor? Let’s take a look, then I’ll explain how to identify and troubleshoot them if you encounter them.
A noisy neighbor is a single virtual machine (VM) on a physical host server that over-consumes a host’s resources, which then leaves the other VMs on the server with a resource deficit. Over the years, all the major hypervisors have come up with ways to mitigate the effects of noisy neighbors, but these techniques are imperfect and often fail due to improper implementation.
Unfortunately, the first indication that you have a noisy neighbor is often that one or more virtual desktop users complain about poor performance. In ControlUp, this issue can manifest itself with increasing values within the CPU Ready and/or Processor Queue Length columns.
As with all things regarding system tuning and performance, there are nuances of which you should be aware and take into account. In basic terms, the processor queue length refers to the number of threads that are in the server’s processor queue, waiting to be executed by the CPU. CPU Ready is the time that a VM is ready to run but is waiting on the hypervisor to schedule it time to run. A general rule of thumb is that when CPU Ready reaches 5%, you should be on alert; when it reaches 10%, users will notice problems and start to complain about the responsiveness of their applications.
UX Score indicators rising above normal can also indicate that there is a noisy neighbor present. Since it’s more common with CPU metrics, however, I’m going to focus on that.
Often, a system administrator with a background in administering physical desktops or servers will try to troubleshoot the issue with additional hardware, but this won’t necessarily solve the problem in a virtual environment. Instead, the administrator should attempt to identify the noisy neighbor and its root cause, so it can be dealt with. As it happens, ControlUp makes it easy to do this.
The CPU Hog
The first time I encountered a noisy neighbor issue was in the early days of vSphere, and I actually inflected it myself. I wanted to see how much performance I could get from a VM, so I created one that had eight vCPUs and ran a CPU benchmarking process on it. What I didn’t know was that there were other users on the system trying to get work done, and my VM was consuming every spare CPU cycle. Needless to say, this was adversely affecting the performance of the other VMs. The good news is that CPU hogs are usually the easiest type of noisy neighbor to identify and prevent.
The process I’ve used in ControlUp to identify a noisy neighbor over-using a CPU is, first, to identify the host that the noisy neighbor is on, group the VMs by host, expand the host, then see how it’s affecting the other VMs on that host. Last, I will identify the process or application on the noisy neighbor that is using the CPU cycles and terminate it if necessary.
In the example below, I deployed seven virtual desktops on a small server that only had four cores, eight threads, and 64 GB of RAM. I simulated various desktop loads on the six virtual desktops and a heavy load on the other one.
Once I identified a desktop that I expected was being adversely affected by a noisy neighbor, I needed to see all the other VMs running on the host. To do this, I selected the datacenter in my organization tree, selected the Machines tab in the navigation bar, and then sorted the machines by host by dragging the Host Name column header to the Group by area.
To monitor the desktops on the host, I selected Detailed View – CPU from the Column Preset drop-down menu in the ControlUp Console. I then looked at the CPU, Processor Queue Length, and CPU Ready columns. The first two columns (CPU and Processor Queue Length) come from the Windows OS, while the other column (CPU Ready) comes from the hypervisor.
Without a noisy neighbor and the desktops doing minimal work, the columns had nominal values.
With a noisy neighbor present and desktops doing a normal workload, on the other hand, the values in these two columns increased significantly and the noisy neighbor had high CPU Ready values.
Once you’ve identified a noisy neighbor problem, you have many options for how to deal with it. If the workload that is causing the resource contention is legitimate and critical, you can use VMotion to move it (or other machines) from one host to another. If the noisy neighbor is legitimate, but not time-sensitive, you can use ControlUp to set its priority. To do this, right-click the process that is over-consuming resources, then right-click Processes and select Set Process Priority or Start CPU throttling.
The Set Process Priority option will allow you to adjust the priority in a range from Idle to Real-Time.
The Start CPU Throttling option will set a limit for the CPU consumption of the process.
After I set CPU throttling to 27% for the process on the noisy VM, its CPU usage dropped from ~90% to ~58% and gave the other desktops enough cycles to operate to their full capacity.
ControlUp has a script action called Set VM resource allocation level that will change the resources (CPU, HDD, memory) allocation of a given vSphere VM. By default, the allocation is increased one ‘SharesLevel’ for CPU, HDD, Memory or all three of these resources.
If you need to only set the priority of the CPU of a VM you can use one of our Set VM CPU scripts.
Or, of course, if the workload is non-critical or non-legitimate, you can always use ControlUp to terminate it by clicking Processes and selecting End Process or Kill Process. The End Process function will try to gracefully end the process, while Kill Process will ungracefully kill the process.
After I killed the process that was overusing CPU resources, I saw that the other desktops on the host had resumed to normal levels of activity and that the CPU Ready value was minimal.
Over the past two decades, hypervisors have gotten much better at detecting and eliminating noisy neighbors, but we still see them on occasion. As such, we must be able to quickly identify and mitigate them to minimize disruption to other desktops residing on the same host. Thanks to ControlUp full-stack monitoring, identifying and dealing with noisy neighbors is trivial and can be done entirely from within the ControlUp Console. If you find that you have recurring issues with noisy neighbors, you could set up a trigger to set a script such as Set VM resource allocation level to self-heal the problem.