Reduce Costs and Optimize Azure Virtual Desktop

One of the major drawbacks of using Azure—or any cloud—based deployments is the way you get billed. The more cloud services you use, the more you get charged. This can get especially costly if users, even when they are idle, remain logged on to your servers. But with ControlUp, Azure cost management can be simplified. You can get a handle on timeouts, logging users off when they’re not working, so you can reduce costs and optimize your Azure Virtual Desktop environments.

 

I was in the middle of an environment migration. The new design had everything: 100% redundancy, the latest and fastest hardware, and it was designed with security and performance in mind. It was cutting edge. People were going to love it!

This is what’s known as “foreshadowing.”

Reader, they did not love it.

As we migrated users from their “old” farm, we started to receive complaints. Some people were complaining that when they were working, their app would disappear. Another complaint we heard was that people would come back from lunch and their application would be closed.

The users talked about it being a jarring experience. Applications would just disappear—POOF!—in front of their eyes. Or they would return from lunch and their app would simply be gone. Relaunching it wouldn’t restore their previous session and they’d have to start their entire workflow over again. This was especially infuriating because they had information in their apps that was lost when the app disappeared.

It turned out, as part of a security requirement, all users had to adjust to new, aggressive timeouts. 30 minutes idle, 30 minutes disconnected, then log off. This was the cause of the complaints. Even though users who were migrated had been made aware and signed off on the new timeouts, until you experience time limits like these, it’s hard to guess if they actually work for you. The apps team and users signed off on this configuration, but when they actually experienced it, they quickly saw that this was definitely not creating an environment that allowed them to work efficiently.

The solution?

Extend the timeouts.

They were extended to 60 minutes idle, and 30 minutes disconnected.

This helped for most people, but different app teams were still struggling because their workflows took them away from their workstation for more than an hour and they were still losing their work. For these teams, timeouts were extended to two hours idle, and two hours disconnect. This satisfied the apps team and complaints from our users dropped.

This is the balancing act admins go through when trying to simultaneously ensure a good digital employee experience, security, and user productivity. When applied to resources in the cloud, it gets way more complicated, and potentially prohibitive.

 

The Trouble with Timeouts

Timeouts in a multi-session machine like Windows Server or Windows 10 Enterprise for Virtual Desktops (Win10 EVD) have two distinct phases with individual timers attached to each. The states user sessions can have are “Active” and “Disconnected.”

But what about Idle? How do you determine that?

Monitoring programs, like ControlUp, determine if a session is Idle by looking at the timer attached to the Active session state. If the timer is 30 minutes or greater, ControlUp considers the session to be “Idle” and will show it as such in our historical monitoring application, ControlUp Insights. With Insights, you can see the “flow” of session states over the course of time.

Organization with two-hour idle and two-hour disconnected timeouts over 24 hours.
Organization with two-hour idle and two-hour disconnected timeouts over 24 hours.

 

How are these timers set? They are configured via Group Policy or by setting their equivalent registry values. For Active sessions, the timeout value is set with this policy, “Set time limit for active but idle Remote Desktop Services sessions.”

 

Set time limit for active but idle Remote Desktop Services sessions.”

 

Once the “Idle session limit” is exceeded, the session can be logged off or moved into a disconnected state. Moving a session to a disconnected state stops the application or desktop from being remoted from the server to the client, but keeps everything in the session running. As a user, they can reconnect to a disconnected session and still have their applications running.

 

Set time limit for disconnected sessions in AVD

 

If you stick to applying your timeouts via the group policy editor, you can see you don’t have granular choices. For instance, 45 minutes, 90 minutes, and other values are not available in the GUI. If one-hour timeouts are insufficient, you’ll be forced to choose two hours.

 

Cloud Services Have a Different Mindset

The major cloud service providers have a different mechanism for billing. With Azure, AWS, IBM, and Oracle, you are charged with consumption-based billing: the more you consume of a cloud service, the more you pay. This is analogous to how an individual pays for utility services like gas or electricity.

 

With Azure, AWS, IBM, and Oracle, you are charged with consumption-based billing

 

The difference between utilities in your home and cloud services is that there are different properties within the cloud service that get charged at different rates. You might end up paying for network traffic, the amount and type of storage, or whether your cloud machine is “allocated” to a CPU. In addition, time becomes an important factor. It’s not enough to just “order” five virtual machines and think you are done with it. Consumption-based billing has a rate. Those five machines each have a rate at which they’ll be charged. “Pay as you go” charges for Microsoft Azure accrue at a per-second rate for the cloud resources you consume.

Services like Azure Virtual Desktop (AVD) run on Microsoft Azure and have requirements to run some workloads on Azure, such as multi-session Windows 10. This version of Windows is called Windows 10 Enterprise for Virtual Desktops, or Windows EVD for short. It’s an exciting feature of the AVD service because multi-session Windows 10 includes the full desktop-type user experience you would get with a physical computer.

To ensure your organization is productive, an admin will deploy as many virtual machines as is necessary for their users to use applications or desktops. If you have more machines than users, then you’re wasting money by consuming more cloud service than you need. You can reduce the amount spent on cloud services by reducing consumption as much as possible.

In the case of Azure Virtual Desktop, Citrix Cloud, Horizon Cloud or other EUC solutions, the biggest expense comes from virtual machine-allocated CPUs. To reduce your spend, you only want machines allocated when they are being used. If they’re not in use, then you want them deallocated. Users are the largest case for a machine being powered on and allocated.

With cloud services, we need to think about how we handle our users and our machines. Leaving servers online for longer than needed leads to unnecessary expenditures. To reduce the time machines are running, we need to revisit timeouts and the ways they impact our users. We need to understand a user’s session life cycle to see where we can optimize.

 

Session Lifecycle

Reviewing the typical organization’s session state life cycle over a 24-hour period, we can suss out some information.

 

Organization with two-hour idle and two-hour disconnected timeouts over 24 hours.
Organization with two-hour idle and two-hour disconnected timeouts over 24 hours.

 

These sessions had two-hour idle and two-hour disconnected timeouts. If you broke down what an individual session life cycle looked like, you’d see something like this:

 

Individual session lifecycle
Individual session lifecycle

 

Users log on at the start of their shift, work until lunch, when their session goes idle. They come back from lunch and work for the afternoon. At this point, there tend to be two ways that a session moves: either the user logs off or they walk away, letting their session idle until it exceeds all the timers and logs off on its own. If we were to look at a series of session life cycles on a multi-session server, we’d see something like this:

 

Server showing session lifecycles
Server showing session lifecycles

 

Users don’t all log on at the same time. They don’t log off at the same time, either. Some users work longer than others, some start later in the day, or for whatever reason, they stay active on the server longer than other sessions.

 

The End-of-the-Day Problem

The end-of-the-day problem

 

Jim Moyle, our good friend on the AVD team at Microsoft, has an excellent presentation on the end of user sessions and their impact on keeping machine’s online and granted me the rights to talk about them in this blog. Thank you, Jim.

 

The “end-of-the-day” problem: how many users need to be logged off to shut down 50 percent of your machines?

 

Let’s start by visualizing the “end of the day” where users log off a multi-session machine.

 

Vizualization the “end of the day” where users log off a multisession machine.

 

Users log off randomly and at different times. Eventually, all users log off and the machine can be shut down. This is the end goal. Stop getting charged for the machine by powering it off.

Now let’s imagine a scenario with 512 users spread across 32 servers. How many users would need to be logged off to power down 50 percent of the machines?

If 512 users spread across 32 servers, how many users would need to be logged off to power down 50 percent of the machines?

 

The answer: there’d need to be just 21 users left, or 4% of the population.

To shut down 50 percent of the machines, you need 96 percent of the users to have logged off! But what if you have fewer than 16 users per server? Or what about 32 users? Or 8?

Jim created a program to test these numbers and ran some simulations. The results:

 

As the number of users per server increases, you need a larger percentage of overall users logged off to free up machines to be shut down.

 

As the number of users per server increases, you need a larger percentage of overall users logged off to free up machines to be shut down.

This observation led to a mathematical formula by Angela Yiang that describes this behavior:

 

As the number of users per server increases, you need a larger percentage of overall users logged off to free up machines to be shut down.
Thanks, Angela Yiang!

 

The biggest roadblock in preventing a machine being shut down is an active user. As you increase user density, you need a larger percentage of the users to have logged off before you can power off a machine.

This brings us to one of the biggest challenges with cloud deployments.

 

How Can ControlUp Help?

To maximize savings, you want to shut down your machines. To shut down the machines, you need your users to be logged off, leaving the machines empty.

This simple requirement leads to a lot of complexity. ControlUp can help by tweaking one of the variables of the equation. Using ControlUp Automate, we can increase the probability of a single user being logged off.

 

Sliding timeouts

 

Timeouts are immutable values set by winlogon.exe when a session is created. Admins who configure their timeouts to maximize organizational productivity during business hours trade off having servers online for longer, servicing idle or disconnected sessions.

ControlUp can monitor both the session states and the duration of those states. When used with ControlUp’s triggers, we can define timeouts shorter than the timeouts defined by Group Policy. ControlUp’s triggers can also define a unique schedule that allows timeouts at different points in time.

I’ll use an example to better illustrate what this means.

Imagine you’ve configured a two-hour idle, two-hour disconnect timeout with business hours ending at 5 p.m.. If every user was to walk away from their computers at exactly 5 p.m., there would be a four-hour window when the user sessions would keep the server online. Only at 9 p.m. would the server be able to shut down and deallocate.

With Sliding Timeouts, we can define something like a one-hour idle, one-hour disconnect after 5 p.m.

This would mean all sessions would be off the server at 7 p.m.: resulting in a two-hour savings in the cost of the servers! And these savings will only multiply with the number of servers in your organization.

In a perfect world, all users would walk away at the same time, and this would work GREAT.

Unfortunately, we don’t live in a perfect world. Users sometimes work overtime or might accidentally make their session active again. Looking back at the previous example with a single machine with multiple users, a single user can keep a server online for a long period of time, with session timeouts compounding the cost of keeping the machine alive.

 

Server showing session lifecycles
Server showing session lifecycles

 

Using ControlUp to define additional timeouts, we can further slide the timeout to get more aggressive the further time is from the end of business hours.

For instance, we can have another set of more aggressive timeouts; like 30-minutes idle, 30-minutes disconnect after 6:00 p.m. To go even further, we can make the timeouts more aggressive: 15-minutes idle, 15-minutes disconnect after 7:00 p.m.

Visualizing the sliding timeouts, our multi-session server looks like this:

 

Server showing session lifecycles with sliding timeouts
Server showing session lifecycles with sliding timeouts

 

The white line is the point in time where the first sliding timeout comes into effect. Each hour that follows, the timeouts get reduced by half. This is visualized by the bars getting thinner. When overlayed on top of the server without sliding timeouts, we see something like this:

 

Server with sliding timeouts vs server with standard timeouts
Server with sliding timeouts vs server with standard timeouts

 

In this scenario, the server is freed up to be powered off three and a half hours faster. Multiply that by the number of servers you have in your environment and the per-day savings can be substantial. Depending on your number and type of servers, thousands of dollars could be saved per day!

There is an adage in the smartphone market called “race to idle” where the processors in smartphones try to complete the work given to them as fast as possible, so they can stay in a deep idle state for longer periods of time. With the cloud, this adage can be tweaked to be “race to savings.”

With ControlUp and its ability to monitor down to the second, savings can be maximized, and resources optimized. Want to set this up in your environment? Just follow this step-by-step guide:

How To: Automate Your Microsoft Azure Virtual Desktops with ControlUp