Windows x64 – All the Same Yet Very Different, Part 2: Kernel Memory, /3GB, PTEs, (Non-) Paged Pool
This is the second part of a mini-series on Windows x64, focusing on behind the scene changes in the operating system. In the first article I explained key concepts of the x86 platform, namely that user-mode processes each get 4 GB of virtual address space, but can use only half of it, because 2 GB are used for the operating system kernel.
The Kernel – Another Beast Altogether
In order to understand why, we need to take at look at how kernel mode differs from user mode. Kernel mode basically is god mode. In kernel mode, or ring 0, you own the machine and can do everything. Like crashing the machine for good. Such crashes in ring 0 differ from “ordinary” program failures in that they cause the dreaded blue screen of death (BSOD) on Windows computers. Blue screens, by the way, are not random stops but rather more or less graceful shutdowns during which the system in most cases is still capable of accessing the hard disk and storing a memory dump file there.
Some trivia on that: You may have wondered why the page file needs to have at least the size of the physical RAM for the system to be able to create a full memory dump. This is because the file system driver code in memory may be corrupted or the driver may even be the cause of the crash. With the page file, the system has a location on hard disk guaranteed to be available. When a memory dump is created during a blue screen the contents of the RAM is written directly to the sectors on disk occupied by the page file bypassing the file system driver. During the next boot process the session manager subsystem (SMSS.EXE) extracts the dump (and copies it to %SYSTEMROOT%\MEMORY.DMP) before the page file is put to its proper use.
Causes of Death
To digress some more: Have you ever wondered what actually causes blue screens? Well, hardware failures don’t. If you have a severe hardware problem, the system will just freeze or reboot out of the blue. No, blue screens are caused by software failures. To put it bluntly: some programmer’s carelessness is responsible for your overtime! Blue screens are caused by “misbehaving” kernel code that tries to do things it shouldn’t. Read: the current IRQL level does not allow this type of access (IRQL_NOT_LESS_OR_EQUAL). Or: not checking some buffer size correctly and writing to a memory area occupied by another driver’s data or even code, which is similar to driving a tank through densely populated terrain.
Only the highest development standards and rigorous testing can harden device drivers enough to withstand even unlikely conditions or heavy load. That is the simple reason why most blue screens are not caused by Microsoft code but by some third-party driver.
What to do in case of a blue screen? You will want to know which driver caused it. In many cases, determining the culprit is amazingly simple. In others, getting to the root cause (and driver) can be extremely hard. If WinDebug does not point you in the right direction, a support call with Microsoft probably helps. However, they will need a memory dump file to analyze the crash.
You can configure several types of memory dumps to be created in case of a blue screen. Mini dumps are practically useless. Full memory dumps are too large for most purposes (try uploading an 8 GB dump file to Microsoft support – not that they mind; it just takes far too long). Kernel dumps are the way to go. This knowledge base article explains the configuration options.
The Tale of the Rings
Let’s now move back to the saga of the rings. An x86 processor has four protection rings: ring 0 to ring 3, with ring 3 being the least privileged. As mentioned above, the kernel operates in ring 0 which gives it full access to the hardware. Rings 1 and 2 are not used. User mode applications execute in ring 3, each cosily encapsulated in the safety “bubble” of virtual memory. Interestingly, with the recent addition of hardware virtualization technology into the CPU, the x86 architecture now offers a new “Ring-1” that can be used by hypervisors to control ring 0 hardware access.
Why 2 GB Are Not Enough…
With all this background information passed on, we can move on to the inherent limitations built into the x86 architecture. As mentioned earlier, the kernel has only 2 GB of global memory available for itself that remains the same no matter what user-mode process is currently active. In those 2 GB it needs to keep track of every process and thread started, of every file opened, of every network connection, even of every single registry handle opened by applications. While one handle to a registry key or a file does not amount to much, hundreds of thousands of them can occupy a lot of memory. The more applications are running, the more handles and various other memory structures are needed. For that reason terminal servers, which typically have 50-60 active users, often run out of available kernel memory before other system resources are depleted.
The kernel’s memory consists of several areas, most notably “paged pool” (memory pages that can be swapped out), “non-paged pool” (pages that need to reside in RAM) and “System PTEs” (page table entries). Each of these areas has a fixed maximum size that is calculated during system startup. Determining both the current and maximum values for a given system is easy once you know how to do it:
…And How to Check
- Install the Debugging Tools for Windows
- Download and unzip LiveKd from Sysinternals
- If you installed the Debugging Tools into a different folder than \Program Files\Microsoft\Debugging Tools for Windows then you need to copy LiveKd into the installation directory of the Debugging Tools
- Run LiveKd.exe and enter a path to an existing empty directory to be used as symbol path.
LiveKd is an awesome tool that takes a memory dump snapshot of the running system. The dump file can then be analyzed with the console debugger kd.exe which is started by LiveKd.
Type !vm to generate a memory usage list. For our purposes, only a few lines of the output are required:
- Free System PTEs tells you how many PTEs are still available.
- NonPagedPool Usage tells you how much non-paged pool is in use.
- NonPagedPool Max tells you the total size of non-paged pool memory.
- PagedPool Usage tells you how much paged pool is in use.
- PagedPool Maximum tells you the size of paged pool memory.
Hint: The Debugging Tools have rather unusual commands. In order to quit kd, type CTRL+B followed by ENTER.
Now that we know that kernel memory is a scarce resource on a terminal server let us examine an optimization tip sometimes found in shady corners of the internet. According to some, using the /3GB switch in boot.ini increases available memory for applications. While that is technically true, let me phrase my short answer like this: NEVER, EVER use that switch on a terminal server!
The long answer comes here: The /3GB switch restricts kernel memory to 1 GB, essentially cutting down the kernel’s memory resources by 50%. This does indeed free up memory for applications (which live in user mode), but you should ask yourself two questions:
- Is that even necessary?
- At what cost?
Typically only large databases need more than 2 GB of virtual memory. If you have such an application, consider using /PAE or, even better, moving to an x64 system.
Terminal servers, on the other hand, typically run a myriad of applications in parallel that each has a relatively small memory footprint. In such a scenario, managing all the applications is far more costly in terms of memory consumption than on a single-app database server. On terminal servers, the memory needed for the management metadata typically is the first resource to become depleted. And that metadata is stored – in the kernel. It is therefore counterproductive to trim the kernel’s memory when you should expand it.
With Windows Server 2008 Microsoft introduced an interesting feature that should help use kernel memory more effectively: Dynamic Kernel Address Space. Server 2008 can reassign kernel memory between the different kernel memory areas, which might help W2k8-based terminal servers perform better than their predecessors.
This concludes the section on the 32-bit x86 architecture. In the next articles of this series, I will explain how x64 differs from x86 and present all those tiny little details one needs to know in order to really understand what is going on.