Troubleshooting is a dark art at the best of times. It involves information gathering, deducing operations based on outputs or results, and then some hypothesis on the root cause and testing the presumed. It can be made very hard if the data you need to understand a process is opaque or missing. Fortunately, ControlUp brings visibility to the darkness and augments your ability to gather information. This article goes over a challenge I faced when presented with a problem and how ControlUp was able to help solve it quickly.
A ticket came in that we were running out of disk space on the FSLogix file share. Investigation showed gigabytes of FSLogix user profile containers with “CORRUPT” in their file name. Each time a user would logon a new “CORRUPT” profile would appear. Some of these were gigabytes in size, some hundreds of megabytes, but overall it was a LOT!
Searching through the FSLogix file share revealed the extent of the issue.
Where do you begin to tackle this issue? With ControlUp of course!
I wrote a powershell script — converted to a script action within ControlUp — that will parse the FSLogix Profile.log file for the exact session where this profile container was loaded from. To execute the action within ControlUp, right-click on a user session and go to “User Environment” and then “FSLogix log parser”
A dialog will show you the execution of the action. Searching through the results of the action reveals the moment FSLogix had determined the profile was corrupt.
By reading the entries around this line I saw a “ERROR” reported giving a hint to the cause:
ERROR:00000002 Creating new user profile disk (user’s registry hive was missing) (The system cannot find the file specified.)
This script action removes any noise by focusing just on the activities of this particular user, so by focusing on the timestamps in the log I saw there was a ~8 second gap. This lies between the lines “DiskContext: mount returning after X milliseconds” and the next line flagged as ERROR “Creating new user profile disk…”
This 8 second gap implies some kind of processing is happening outside of FSLogix that leads to the failure. In order to determine what was occurring during this time frame I want to run Process Monitor (procmon) for a user logon exhibiting this issue as procmon will show everything happening on the machine during that time.
Running procmon during a user logon is easy when you have a multi-session machine like Windows Server or Windows 10/11 multi-session. But on a single-session client OS like the user I’m working with… It could be cumbersome and time consuming to set up a machine where I could execute procmon without being logged on to the machine.
Thinking about how I would do this… I would logon to the machine, setup a scheduled task (as they can be executed remotely), learn the proper command-line parameters to give to procmon, test it, and then probably snapshot the machine and then contact the user, execute the scheduled task remotely and have them logon.
That is a ton of work! And it becomes more difficult if you want to do it on a machine that is being managed by Citrix Provisioning (PVS) or a Horizon Instant Clone as you would have to update the master image(s)!
OR… I could take the very simple route of running procmon remotely on a machine using a ControlUp script action “Procmon – Trace System Activity”. Super simple, select a machine (or multiple), right-click and go under “System health > ProcMon – Trace System Activity”
In the prompt that appears I specified a duration longer than the expected logon duration and then hit “OK” and had the user logon.
The script action will output a dialog detailing where the log file will be saved and when Process Monitor will be complete with its trace.
When I examined this procmon file I was having trouble finding anything that showed a failure. I surmised that it may be due to procmon’s filter driver altitude. The way that a filter driver altitude works is a lower value will see activities on drivers above it, but not vice versa. By running the ControlUp Script Action “Filter Driver Altitude” I can get a list of the filter drivers.
To see where procmon is on the altitude list, I need to run procmon on the machine first to get the driver installed. Once that was done I ran the script action and got the following result:
(Procmon filter driver identified by the red arrow, FSLogix filter driver identified by the yellow arrow)
The Procmon driver is loaded above the FSLogix filter drivers. Activities managed by the FSLogix filter drivers (beginning with the letters frx) won’t be seen by Process Monitor.
I need to load the Procmon filter driver below the FSLogix filter drivers. The challenge, again, is to do so without having to logon to a single session desktop first. Fortunately, ControlUp offers a solution, “Run As”. This feature is available in the ControlUp Console by right-clicking on a machine, going to “Processes” and then “Run As”.
This brings up a dialog where we can specify a command to execute. I copied procmon.exe to the machine and ran the following command:
C:\swinst\procmon.exe /altitude 42000 /accepteula /quiet /runtime 1
This sets the altitude of the procmon filter driver to 42000 from its original value of 385200.
After executing the command and re-running the Filter Driver Altitude script action I confirmed procmon was now lower in altitude.
At this point I ran the “Procmon – Trace System Activity” again while the user was logging on. Once the user completed their logon I re-ran the “FSLogix Log Parser” script action to find the time frame of the “corrupt” error message. The same ~8 second gap existed again.
I found the time frames in the procmon log and right-clicked on the time before the error time in the log and chose “Exclude Events Before…” then found a time after the 8 second gap and right-clicked on the time and selected “Exclude Events After…”
In order to best see failures in Procmon, I went to the Filter menu, selected “Highlight” and added a rule “Result is not SUCCESS then include”. This will highlight all results that are not success. It’s a neat little trick to bring visibility to “errors” in the trace.
A symptom of these user logon’s is that a TEMP profile is created. This gives me some data to search for within Procmon. And I found the moment it created the Temp directory. An oddity here is FSLogix set the profile value to C:\Users\Temp. FSLogix doesn’t come up with this value, it gets it from somewhere else.
I searched up for more instances of this path. I found the moment it was created. Procmon is so good for this. You can see the first arrow from the top is the User Profile Service checking if the path exists (NOT FOUND) then it actually creates it! So cool to see this.
Why did the User Profile Service think it needed to create temp directory?
A neat behavior of Windows apps/service is they will read registry keys and values in response to stuff. I found the User Profile Service trying to read a registry value “ProfileErrorAction”. This registry value is an action that should be taken if Windows can’t load their profile. By default Windows logs on the user with a temporary profile when Windows can’t load their user profile. This is exactly what is occurring, so Windows found it can’t load the profile, and at this point needed to make a decision about what to do about it. So the error occurred before this point.
Something I recalled when looking through the event log is the User Profile Service logged an error — Event ID 1521 – “Windows cannot locate the server copy of your roaming profile and is attempting to log you on with your local profile. Changes to the profile will not be copied to the server when you log off. This error may be caused by network problems or insufficient security rights.
DETAIL – Access is denied.”
Denies typically happen on registry and file system activity. I enabled only ‘File System’ activity in procmon and found some big red flags.
A path to a file share was getting denied.
When I had the user attempt to navigate to that share they got the big bold “Windows cannot access” error message:
When processes query the registry for values, those values end up in the Details column in procmon.
I added a filter to see where or if this path was being read. Sure enough, it was! There is a Roaming Profile Group Policy value in place:
What is the end result? I had a user who couldn’t access a file share that the User Profile Service needs to create profiles. Since the User Profile Service couldn’t access the share it marked the profile as temporary — this prevents FSLogix from saving anything including the NTUSER.DAT when the user logs off. Upon next logon FSLogix noticed the missing NTUSER.DAT and marked the profile as corrupt and created a new profile container.
How did I solve this? I added the user to the appropriate group that grants access to this share. And now the User Profile Service can successfully create the profile for that user and FSLogix now creates good working profile containers.
ControlUp was built to recognize issues, assist in information gathering, and make remediation easy. Check out how ControlUp can benefit your FSLogix environment today, either take us for a test drive, download a 21 day free trial, or schedule a demo to learn more!