Since the beginning, Linux has mapped the kernel's memory into the address space of every running process. There are solid performance reasons for doing this, and the processor's memory-management unit can ordinarily be trusted to prevent user space from accessing that memory. More recently, though, some more subtle security issues related to this mapping have come to light, leading to the rapid development of a new patch set that ends this longstanding practice for the x86 architecture.
Some address-space history
On 32-bit systems, the address-space layout for a running process dedicated the bottom 3GB (0x00000000 to 0xbfffffff) for user-space use and the top 1GB (0xc0000000 to 0xffffffff) for the kernel. Each process saw its own memory in the bottom 3GB, while the kernel-space mapping was the same for all. On an x86_64 system, the user-space virtual address space goes from zero to 0x7fffffffffff (the bottom 47 bits), while kernel-space mappings are scattered in the range above 0xffff880000000000. While user space can, in some sense, see the address space reserved for the kernel, it has no actual access to that memory.
This mapping scheme has caused problems in the past. On 32-bit systems, it limits the total size of a process's address space to 3GB, for example. The kernel-side problems are arguably worse, in that the kernel can only directly access a bit less than 1GB of physical memory; using more memory than that required the implementation of a complicated "high memory" mechanism. 32-Bit systems were never going to be great at using large amounts of memory (for a 20th-century value of "large"), but keeping the kernel mapped into user space made things worse.
Nonetheless, this mechanism persists for a simple reason: getting rid of it would make the system run considerably slower. Keeping the kernel permanently mapped eliminates the need to flush the processor's translation lookaside buffer (TLB) when switching between user and kernel space, and it allows the TLB entries for kernel space to never be flushed.
Flushing the TLB is an expensive operation for a couple of reasons: having to go to the page tables to repopulate the TLB hurts, but the act of performing the flush itself is slow enough that it can be the biggest part of the cost.
Back in 2003, Ingo Molnar
implemented a different mechanism, where user space and kernel space each got a full 4GB address space and the processor would switch between them on every context switch. The "4G/4G" mechanism solved problems for some users and was shipped by some distributors, but the associated performance cost ensured that it never found its way into the mainline kernel. Nobody has seriously proposed separating the two address spaces since.
Rethinking the shared address space
On contemporary 64-bit systems, the shared address space does not constrain the amount of virtual memory that can be addressed as it used to, but there is another problem that is related to security. An important technique for hardening the system is kernel address-space layout randomization (KASLR), which randomizes the placement of the kernel in the virtual address space at boot time. By denying an attacker the knowledge of where the kernel lives in memory, KASLR makes many types of attack considerably more difficult. As long as the actual location of the kernel does not leak to user space, attackers will be left groping in the dark.
The problem is that this information leaks in many ways.
Many of those leaks date back to simpler days when kernel addresses were not sensitive information; it even turns out that your editor introduced one such leak in 2003. Nobody was worried about exposing that information at that time. More recently, a concerted effort has been made to close off the direct leaks from the kernel, but none of that will be of much benefit if the hardware itself reveals the kernel's location.
And that would appear to be exactly what is happening.
This paper from Daniel Gruss et al. [PDF] cites a number of hardware-based attacks on KASLR. They use techniques like exploiting timing differences in fault handling, observing the behavior of prefetch instructions, or forcing faults using the Intel TSX (transactional memory) instructions. There are rumors circulating that other such channels exist but have not yet been disclosed. In all of these cases, the processor responds differently to a memory access attempt depending on whether the target address is mapped in the page tables, regardless of whether the running process can actually access that location. These differences can be used to find where the kernel has been placed — without making the kernel aware that an attack is underway.
Fixing information leaks in the hardware is difficult and, in any case, deployed systems are likely to remain vulnerable. But there is a viable defense against these information leaks: making the kernel's page tables entirely inaccessible to user space.
In other words, it would seem that the practice of mapping the kernel into user space needs to end in the interest of hardening the system.
KAISER
The paper linked above provided an implementation of separated address spaces for the x86-64 kernel; the authors called it "KAISER", which evidently stands for "kernel address isolation to have side-channels efficiently removed". This implementation was not suitable for inclusion into the mainline, but it was picked up and heavily modified by Dave Hansen. The
resulting patch set (still called "KAISER") is in its third revision and seems likely to find its way upstream in a relatively short period of time.
Whereas current systems have a single set of page tables for each process, KAISER implements two. One set is essentially unchanged; it includes both kernel-space and user-space addresses, but it is only used when the system is running in kernel mode. The second "shadow" page table contains a copy of all of the user-space mappings, but leaves out the kernel side. Instead, there is a minimal set of kernel-space mappings that provides the information needed to handle system calls and interrupts, but no more.
Copying the page tables may sound inefficient, but the copying only happens at the top level of the page-table hierarchy, so the bulk of that data is shared between the two copies.
Whenever a process is running in user mode, the shadow page tables will be active. The bulk of the kernel's address space will thus be completely hidden from the process, defeating the known hardware-based attacks. Whenever the system needs to switch to kernel mode, in response to a system call, an exception, or an interrupt, for example, a switch to the other page tables will be made. The code that manages the return to user space must then make the shadow page tables active again.
The defense provided by KAISER is not complete, in that a small amount of kernel information must still be present to manage the switch back to kernel mode. In the patch description, Hansen wrote:
The minimal kernel page tables try to map only what is needed to enter/exit the kernel such as the entry/exit functions, interrupt descriptors (IDT) and the kernel trampoline stacks. This minimal set of data can still reveal the kernel's ASLR base address. But, this minimal kernel data is all trusted, which makes it harder to exploit than data in the kernel direct map which contains loads of user-controlled data.
While the patch does not mention it, one could imagine that, if the presence of the remaining information turns out to give away the game, it could probably be located separately from the rest of the kernel at its own randomized address.
The performance concerns that drove the use of a single set of page tables have not gone away, of course. More recent processors offer some help, though, in the form of process-context identifiers (PCIDs). These identifiers tag entries in the TLB; lookups in the TLB will only succeed if the associated PCID matches that of the thread running in the processor at the time. Use of PCIDs eliminates the need to flush the TLB at context switches; that reduces the cost of switching page tables during system calls considerably. Happily, the kernel got support for PCIDs during the 4.14 development cycle.
Even so, there will be a performance penalty to pay when KAISER is in use:
KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches.
Not that long ago, a security-related patch with that kind of performance penalty would not have even been considered for mainline inclusion. Times have changed, though, and most developers have realized that a hardened kernel is no longer optional. Even so, there will be options to enable or disable KAISER, perhaps even at run time, for those who are unwilling to take the performance hit.
All told, KAISER has the look of a patch set that has been put onto the fast track. It emerged nearly fully formed and has immediately seen a lot of attention from a number of core kernel developers. Linus Torvalds is clearly in support of the idea, though he naturally has pointed out a number of things that, in his opinion, could be improved. Nobody has talked publicly about time frames for merging this code, but 4.15 might not be entirely out of the question.