IT troubleshooting is a dark art at the best of times. It involves information gathering, deducing operations based on outputs or results, and then formulating a hypothesis on the root cause and testing the hypothesis. It can be made very hard if the data you need to understand a process is opaque or missing. Fortunately, ControlUp brings visibility to the darkness and augments your ability to gather information. This article details how I used ControlUp to quickly diagnose and resolve a significant challenge: troubleshooting FSLogix corrupt user profiles in a virtual desktop environment.
A ticket came in that we were running out of disk space on the FSLogix file share. Investigation showed gigabytes of FSLogix user profile containers with “CORRUPT” in their file name. Each time a user logs on, a new “CORRUPT” profile would appear. Some of these were gigabytes in size, some hundreds of megabytes, but overall it was a LOT!
Searching through the FSLogix file share revealed the extent of the issue.
Where do you begin to tackle this issue? With ControlUp, of course!
I wrote a PowerShell script—converted to a script action within ControlUp—that will parse the FSLogix Profile.log file for the exact session from which this profile container was loaded. To execute the action within ControlUp, right-click on a user session, go to User Environment, and then FSLogix log parser.
Upon execution, a dialog displays the action’s results, enabling you to pinpoint precisely when FSLogix flagged the profile as corrupt.
By reading the entries around this line, I saw an “ERROR” reported, giving a hint to the cause:
ERROR:00000002 Creating new user profile disk (user’s registry hive was missing) (The system cannot find the file specified.)
This script action removes any noise by focusing just on the activities of this particular user, so by focusing on the timestamps in the log, I saw there was about an 8-second gap. This lies between the lines “DiskContext: mount returning after X milliseconds” and the next line flagged as ERROR “Creating new user profile disk…”
This 8-second gap implies that some kind of processing is happening outside of FSLogix that leads to the failure. In order to determine what was occurring during this time frame, I want to run Process Monitor (ProcMon) for a user login exhibiting this issue, as ProcMon will show everything happening on the machine during that time.
Running procmon during a user logon is easy when you have a multi-session machine like Windows Server or Windows 10/11 multi-session. But on a single-session client OS like the user I’m working with… It could be cumbersome and time-consuming to set up a machine where I could execute ProcMon without being logged on to the machine.
I’m thinking about how I would do this… I would logon to the machine, set up a scheduled task (as they can be executed remotely), learn the proper command-line parameters to give to procmon, test it, and then probably snapshot the machine. Then, I would contact the user, execute the scheduled task remotely, and have them log in.
That is a ton of work! It becomes more difficult if you want to do it on a machine managed by Citrix Provisioning (PVS) or a Horizon Instant Clone, as you would have to update the master image(s)!
OR… I could take the very simple route of running ProcMon remotely on a machine using a ControlUp script action, “ProcMon—Trace System Activity.” It’s super simple: select a machine (or multiple), right-click, and go under System health > ProcMon—Trace System Activity.
In the prompt that appears, I specified a duration longer than the expected logon duration, hit OK, and had the user log on.
The script action will output a dialog detailing where the log file will be saved and when ProcMon will be complete with its trace.
When I examined this ProcMon file, I had trouble finding anything that showed a failure. I surmised that it may be due to ProcMon’s filter driver altitude. A lower value will see activities on drivers above it, but not vice versa. I can get a list of filter drivers by running the ControlUp Script Action: Filter Driver Altitude.
To see where ProcMon is on the altitude list, I needed to run ProcMon on the machine first to get the driver installed. Once that was done, I ran the script action and got the following result:
The ProcMon driver is loaded above the FSLogix filter drivers. Activities managed by the FSLogix filter drivers (beginning with the letters frx) won’t be seen by Process Monitor.
I need to load the ProcMon filter driver below the FSLogix filter drivers. The challenge, again, is to do so without having to log on to a single session desktop first. Fortunately, ControlUp offers a solution: Run As. This feature is available in the ControlUp console by right-clicking on a machine, going to Processes, and selecting Run As.
This brings up a dialog where we can specify a command to execute. I copied procmon.exe to the machine and ran the following command:
C:\swinst\procmon.exe /altitude 42000 /accepteula /quiet /runtime 1
This sets the altitude of the ProcMon filter driver to 42,000 from its original value of 385,200.
After executing the command and re-running the Filter Driver Altitude script action, I confirmed that the ProcMon was now lower in altitude.
At this point, I ran the Procmon – Trace System Activity again while the user was logging on. Once the user completed their logon, I re-ran the FSLogix Log Parser script action to find the time frame of the “corrupt” error message. The same ~8-second gap existed again.
I found the time frames in the Procmon log, right-clicked on the time before the error time in the log, and chose Exclude Events Before… Then, I found a time after the 8-second gap, right-clicked on the time, and selected Exclude Events After…
In order to see failures in ProcMon more clearly, I went to the Filter menu, selected Highlight, and added a rule Result is not SUCCESS, then include. This will highlight all results that are not successful. It’s a neat little trick to bring visibility to “errors” in the trace.
A symptom of these user logons is that a Temp profile is created. This gives me some data to search for within ProcMon. I found the moment it made the Temp directory. An oddity here is that FSLogix set the profile value to C: UsersTemp. FSLogix doesn’t come up with this value; it gets it from somewhere else.
I searched for more instances of this path and found the moment it was created. ProcMon is so good for this. You can see the first arrow from the top is the User Profile Service checking if the path exists (NOT FOUND), and then it actually creates it! It’s so cool to see this.
Why did the User Profile Service think it needed to create a Temp directory?
A neat behavior of Windows apps/service is that they will read registry keys and values in response to stuff. I found the User Profile Service trying to read a registry value ProfileErrorAction. This registry value is an action that should be taken if Windows can’t load its profile. By default, Windows logs on the user with a temporary profile when Windows can’t load their user profile. This is exactly what is occurring, so Windows found it couldn’t load the profile, and at this point, it needed to make a decision about what to do about it. So the error occurred before this point.
When looking through the event log, I remembered that the User Profile Service logged an error:
Event ID 1521—”Windows cannot locate the server copy of your roaming profile and is attempting to log you on with your local profile. Changes to the profile will not be copied to the server when you log off. This error may be caused by network problems or insufficient security rights.
DETAIL – Access is denied.”
Denials typically happen on registry and file system activity. I enabled only File System activity in ProcMon and found some big red flags.
A path to a file share was getting denied.
When I had the user attempt to navigate to that share, they got the big, bold Windows cannot access error message:
When processes query the registry for values, those values end up in the Details column in ProcMon.
I added a filter to see where or if this path was being read. Sure enough, it was! There is a Roaming Profile Group Policy value in place:
What is the end result? I had a user who couldn’t access a file share on which the User Profile Service needs to create profiles. Since the User Profile Service couldn’t access the share, it marked the profile as temporary, preventing FSLogix from saving anything, including the NTUSER.DAT, when the user logs off. Upon the next login, FSLogix noticed the missing NTUSER.DAT, marked the profile as corrupt, and created a new profile container.
How did I solve this? I added the user to the appropriate group that grants access to this share. Now, the User Profile Service can successfully create the profile for that user, and FSLogix can create good, working profile containers.
ControlUp was built to recognize issues, assist in information gathering, and make remediation easy.
Check out how ControlUp can benefit your FSLogix environment today. Schedule a personalized demo today!