The world is made of pages

Is your memory actually isolated, or is the OS just lying to you? In this part, we tear down the Virtual Memory illusion. We’ll walk page tables by hand, dissect the hardware mechanics, and see exactly what happens when your pointer hits the metal

1. Getting mentally prepared

What we are about to discuss in this part of the series could be a bit confusing at first, therefore I chose to first give an analogy on how it all works together to be prepared when we dive into the technical details.

Paging is a mechanism used by the OS & Hardware to manage the memory, this mechanism divides memory into equal sized blocks (4KB/4096 bytes) called Pages from a OS perspective, and Frame from a hardware perspective, both represent the same thing, the OS is responsible for setting and managing those pages and the data structures necessary for the CPU to use to translate a virtual address of a page into physical. That’s all you need to know for now!

1.1 The spy analogy

Imagine you are a spy trying to decode a secret message hidden inside a massive library. The message is simple: “MEET AT DAWN.”

But you don’t have the message. You only have a slip of paper with a sequence of numbers for each word :

  1. First word : Row 5 -> Shelf 2 -> Book 10 -> Word 50
  2. Second word : Row 5 -> Shelf 3 -> Book 12 -> Word 20
  3. Third word : Row 5 -> Shelf 1 -> Book 15 -> Word 55

To find just that first word, you have to follow the chain:

  1. The Master Map (Page Directory): You walk into the library. You look at the Main Map for Row 5.
    • It points you to a specific aisle.
  2. The Aisle Index (Page Table): You walk down the aisle to Shelf 2.
    • On this shelf, there isn’t just one book; there are hundreds. You use the shelf’s label to locate Book 10.
  3. The Book (The Page): You pull Book 10 off the shelf and open it.
    • This book is exactly 1024 pages long.
  4. The Decoder (The Offset): You don’t read the whole book. You count to Word 50 from the beginning of that book.
    • You look at that exact spot, and there it is: “MEET” (The Data).

The “Aha!” Moment:

  • You need a full sequence of lookups just to find each word, a word here in memory is represented by a virtual address in the computing context.
  • To read the full sentence (“MEET AT DAWN”), the CPU has to repeat this entire process for every single virtual address, billions of times a second (We will discuss some optimizations briefly).
  • The Virtual Address is just that slip of paper with the coordinates.
  • The Physical RAM is the library where the words are actually written.

1.2 The paging gloassary

The first thing we are going to understand is some of the terminology that is used, the following terms are going to be used throughout this post, you do not have to fully comprehend what each term stands for as we will explain in details.

  • Virtual Address (VA): The “fake” address used by software (Process X). It doesn’t exist on a memory chip; it is just a request.
  • Physical Address (PA): The “real” electrical location on the RAM stick.
  • Page: A fixed-size block of Virtual memory (usually 4 KB).
  • Frame: A fixed-size block of Physical memory (usually 4 KB). The goal of paging is simply to map a Page to a Frame.
  • Physical frame number (PFN): The sequential number representing the physical frame.
  • Page Table (The Structure): A generic term for any array of pointers used in translation. Whether it’s a Directory or a Table, they are all just 4KB pages filled with 512 entries.
  • Entry (PTE, PDE, etc.): A single row in a table. It contains the Physical Address of the next level (or the final frame) plus permission bits (Read/Write/Execute).

2. The mechanics (32-bit legacy)

The “Classic” 32-bit Paging was introduced with the Intel 80386. This is the simplest form of paging, and understanding it makes the complex modern versions easier to grasp.

To enable paging, the OS needs to set CR0.PG = 1. From that moment on, the Paging Unit intercepts every single memory access. You cannot bypass it. (Remember this ‘no bypass’ rule— we will discuss its implications later on.)

Now we dive into the structures when paging is enabled. This where the analogy we previously discussed comes into play. Try to map the analogy to what we will be explaining, that should make it easier.

2.1 The two-level hierarchy

The first paging model involved a two level tables (nested page tables). In this hierarchy, to translate a linear to physical address while paging is enabled, the MMU (Specifically the paging unit) would split the 32-bit virtual address into three pieces (Similar to what happens in segmentation) :

32-bit virtual address

Bits Size Total entries Entry size (bytes) Addressable space Name Description
22-31 10 1024 4 1024 * 4 MB = 4 GB PD Used as Index into page directory
12-21 10 1024 4 1024 * 4KB = 4 MB PT Used as Index into page table
0-11 12 4096 (Bytes) 1   Offset Address position from frame base address

These three pieces are used as indexes into the tables mentioned below :

  • Page directory — A table that contains 1024 4-byte entries, each entry is a pointer to a page table ( the lower level table )
  • Page table — Another table consisting of 1024 4-byte entries, each entry represents a 4KB contains information about the physical frame corresponding to the virtual address.

This hierarchy is also known as nested page tables.

2-Level page table

Why two tables instead of one ?

You might be wondering why two tables, why not just one ? The goal is to reduce the amount of RAM required for per-process Page Tables :

  • Each process must have a Page directory assigned to it
  • If we were to combine it all into one table , it would need 1024*1024 (2^20) entries contiguously allocated for the CPU to be able to walk through it find the required page, and this would require a total of 4MB (2^20) * 4 , remember this is per process
  • Based on that , if we were to run 100 processes, it would require us to allocate 400MB just for the tables, without the actual data being allocated in memory
  • The two-level scheme saves the memory required by having the page directory present and allocating page tables , memory pages as needed and updating the directory and tables accordingly

The page table entry (PTE)

Every entry in those tables essentially is a page table entry and adheres to the same structure in most cases. In the final table of the nested page tables, the entry is used to locate the actual physical frame corresponding to the virtual address being translated, since the least significant 12-bit offset from the virtual address dictates the exact location within the frame, it would be wasteful to use the entire 32-bits to store only the frame address, because of that, the hardware developers only used the most significant 20-bits in the PTE to store the Physical frame number (PFN) and use the remaining 12-bits as flags, which we will discuss next.

32-bit Page table entry

Bits Size Name Description
0 1 Present 1 = Page is in RAM. 0 = Page Not Present (Triggers #PF Page Fault if accessed).
1 1 R/W 0 = Read Only. 1 = Read/Write. (Note: In 64-bit mode, WP bit in CR0 determines if Supervisor can write to Read-Only pages) .
2 1 U/S 0 = Supervisor (Kernel) Only. 1 = User Mode Allowed. (This acts as the primary gate) .
5 1 Accessed 1 = CPU sets this when software reads or writes to the page. Used by the OS for “Least Recently Used” (LRU) swapping algorithms.
6 1 Dirty 1 = CPU sets this when software writes to the page. Used by the OS to know if a page needs to be saved to disk before swapping.
12-31 20 PFN The upper bits of the Physical Address of the 4KB page.

Note: I have intentionally omitted some of the flags here, to avoid complicating things, the final table will have all the flags for reference, and we will discuss each flag when necessary as we go through the series.

2.1.1 The Hardware talking to the OS

Before we explain how the addresses are translated, we must answer a couple of questions: What happens if the MMU looks for a page and doesn’t find it? What if the permissions are properly set ? That’s where the flags in the PTE comes into play

The MMU is purely hardware; it cannot help itself. It only knows how to walk the tables. Otherwise it will raise an exception that the OS must handle or it will have a panic attack!

With every page table entry it runs some checks on flags, if it passes those checks, it continues to the next step, here are some examples :

  • Is the Present (P) bit = 1 ? if 0, the CPU stop and throws a Page Fault (#PF)
  • Is the User/Supervisor (U/S) bit valid? If the code is running in User Mode (Ring 3) but this bit says “Kernel Only,” the CPU raises an access violation.
  • Is the Read/Write (R/W) bit valid? Is it Read-only or Is it Readable & Writable.

For example, if the the Present bit is 0 , the CPU will raise a #PF exception.This is the exact moment “The Hardware talks to the OS.”

  1. The CPU pauses the currently executing thread.
  2. The hardware loads the address that caused the crash into a special control register called CR2.
  3. The CPU jumps to the OS’s “Page Fault Handler” (a specific interrupt service routine defined by the OS).

Now the OS takes over. It looks at CR2 to see which address failed, and checks its internal records (like the VAD in Windows) to decide the fate of the process:

  • The Good (Swapping/Demand Paging): “Oh, this address is valid, but I stored the data on the disk to save space.” The OS loads the data from the Page File into RAM, updates the Page Table to mark it “Present,” and tells the CPU to retry the instruction. The process never knows it happened.
  • The Bad (Access Violation): “This process never allocated memory at this address.” The OS terminates the process (Segmentation Fault or Crash).

2.1.2 MMU Lookup & VA to PA

  1. Translating VA from MOV EAX, [0x12345678]
    • Virtual Address: 0x12345678
    • CR3 : 0x0000_1000 (Physical base of Directory)
      • On context switch the CR3 is loaded with the address of the base directory
        • In Windows the address is located in _EPROCESS→ _KPROCESS→DTB
        • In Linux the address is located in task_struct→mm→pgd
  2. Step 1: Extract the Indices (Binary Splitting)
    • Hex 0x12345678 → Binary 0001 0010 0011 0100 0101 0110 0111 1000
    • PD Index (Top 10 bits): 0001001000 = Index 72
    • PT Index (Middle 10 bits): 1101000101 = Index 837
    • Offset (Low 12 bits): 011001111000 = 0x678
  3. Step 2: The Directory Lookup
    • CPU goes to CR3 + (72 * 4 bytes).
    • Reads the PDE. Let’s say it points to PFN 0x3000.
    • The checks after fetching the PDE
  4. Step 3: The Table Lookup
    • CPU goes to 0x3000 << 12 + (837 * 4 bytes).
    • Reads the PTE. Let’s say it points to PFN 0x9000.
  5. Step 4: The Final Calculation
    • Physical Address = Frame 0x9000 << 12 + 0x678
    • Result: 0x9000678

2.2 The translation lookaside buffer (TLB)

The table walking requires accessing memory multiple times during the translation process, the more levels or tables, the more memory access is required. If the CPU had to perform this for every single instruction, it would be extremely slow. Modern performance relies on a piece of hardware known as Translation lookaside buffer (TLB)

The optimization it brings:

  1. TLB Lookup: When you access a Virtual Address, the MMU first checks the TLB.
  2. TLB Hit: If the translation is found, the physical address is returned instantly. No RAM access is needed.
  3. TLB Miss: If not found, the MMU performs the slow “Page Table Walk” we described above. Once the physical frame is found, the result is stored in the TLB for next time.

To every upside there may be downside

Unfortunately this speeds comes with a catch. The TLB contains translations specific to the current process. When the OS performs a Context Switch (switching from Process A to Process B), it updates the CR3 register. Since Process B has completely different page tables, the current TLB entries are now wrong. To address this issue, the CPU must flush (empty) the entire TLB on every context switch which may slow down processes due to retranslation.

VPIDs and PCIDs to save us

Modern CPUs have solved this problem with identifiers known as

  • VPID — Virtual processor ID
  • PCID — Process context ID

Instead of flushing the entire TLB down the drain on every switch, the hardware tags each TLB entry with a specific ID (the PCID).

  • Without PCID: “This virtual address maps to this physical frame.” (Must flush on switch).
  • With PCID: “This virtual address for Process A maps to this physical frame.”

When the OS switches from Process A to Process B, it tells the CPU the new PCID. The CPU then simply ignores the TLB entries tagged “Process A” but keeps them in the cache. If the OS switches back to Process A shortly after, its translations are still there and valid! This massive optimization drastically reduces the cost of context switching.

3. Expanding the map

By the late 1990s, the “Classic” 32-bit paging model hit a hard ceiling. With a 32-bit address bus, the CPU could physically address a maximum of 4 GB of RAM (2^32). For database servers and enterprise mainframes, this was no longer enough.

3.1. Physical address extension (PAE) - 3 Level

Intel needed a way to jam more RAM into a 32-bit system without rewriting the entire architecture. Their solution was PAE.

The 36-bit Hack — The concept was simple: Keep the virtual addresses 32-bit (so software doesn’t break), but expand the physical address bus to 36 bits.

  • Result: 2^36 = 64 GB of addressable physical RAM.
  • The “Window” Effect : A single process could still only see 4 GB of virtual space at a time, but the OS could now juggle many heavy processes across a massive 64 GB physical playground.

The “Window” Concept

Think of the 64-GB of physical ram as a large landspace, and the 32-bit registers as a Window, looking through this window (2^32) the field of view is partial, it can only expose 4 GB of the ram.

To address this issue, the OS moves this ‘window’ by changing the Page Directory entries to point to different non overlapping 4GB chunks of the larger 64GB physical RAM, for example :

  • Process A thinks it owns memory 0x1000.
  • Process B thinks it owns memory 0x1000.
  • The MMU (Memory Management Unit) maps Process A’s 0x1000 to Physical Address 0x1_0000_1000 (above 4GB) and Process B’s 0x1000 to Physical Address 0x2_0000_1000 (way above 4GB).

the only caveat in this method is the size of the CR3 register, since it’s only 32-bit physical address, therefore the top level table (PDPT) must be within the first 4GB of the memory space, the following tables can be anywhere since their entries are in the table itself which now can address 36-bits of space.

virtual address breakdown

To make this possible, intel came up with a new table named Page directory pointer table that consists of 4 entries, to address this change, the virtual address interpretation was changed yet remained 32-bits, with this change the total entries available in PD and PT were reduced from 1024 down to 512 that is because the entry size has changed from 4-bytes to 8-bytes.

32-bit virtual address

Bits Size Total entries Entry size (bytes) Addressable space Name Description
30-31 2 4 8 4 * 1 GB = 4 GB PDPT Index of Page directory pointer table entry
29-21 9 512 8 512 * 2 MB = 1 GB PD Index of Page directory entry
20-12 9 512 8 512 * 4 KB = 2 MB PT Index of Page table
0-11 12 4096 (Bytes) 1 - Offset Address position from frame base address

Page tables hierarchy with PAE

With PAE, a new page table has been added to the top of the hierarchy, making it three levels instead of two :

  • Page directory pointer table (PDPT)— A table that contains 4, 8-byte entries , each one points to a page directory
  • Page directory (PD)— A table that contains 512, 8-byte entries, each entry is a pointer to a page table
  • Page table (PT) — Another table consisting of 512, 8-byte entries, each entry contains the number of a 4KB Physical Frame among some flags.

Anatomy of 8-byte PTE

The new additional table was not enough, a few extra bits were needed for the PFN to be capable of addressing more physical ram, because of that the size of entries was expanded to 8-byte long leaving plenty of space for other flags as well as the frame number, the following figure depicts the expanded PTE :

In addition to the expansion of the PFN , the NX bit was introduced adding another security check

  • Bit 63 was introduced to allow the CPU whether this Page or directory of pages are executable to prevent attacks that rely on executing data from writable sections.
Bits Size Name Description
0 1 Present 1 = Page is in RAM. 0 = Page Not Present (Triggers #PF Page Fault if accessed).
1 1 R/W 0 = Read Only. 1 = Read/Write. (Note: In 64-bit mode, WP bit in CR0 determines if Supervisor can write to Read-Only pages) .
2 1 U/S 0 = Supervisor (Kernel) Only. 1 = User Mode Allowed. (This acts as the primary gate) .
5 1 Accessed 1 = CPU sets this when software reads or writes to the page. Used by the OS for “Least Recently Used” (LRU) swapping algorithms.
6 1 Dirty 1 = CPU sets this when software writes to the page. Used by the OS to know if a page needs to be saved to disk before swapping.
12-35 24 PFN The upper bits of the Physical Address of the 4KB page.
36-62 27 Reserved  
63 1 NX/XD 1 = Instruction fetches are not allowed from this page. 0 = Execution allowed. (Acts as a veto) .

Note that on modern systems and CPUs, it’s no longer 36-bit , the PAE can now address up to 40 bits to address more memory, it’s the same concept but they expanded the frame bits taken from the reserved bits.

3.2 Page mape level 4 (PML4) — Long mode

When x64 arrived, we needed to map a lot more RAM. So, we added a 4th table: The PML4. Mechanically, it works exactly like the tables below it. It adds 9 more bits of addressing, allowing us to map a massive 256 TB of RAM.

64-bit virtual address

Bits Size Total entries Entry size (bytes) Addressable space Name Description
39-47 9 512 8 512 * 512 GB = 256 TB PML4 Index of Page map level 4 entry
30-38 9 512 8 512 * 1 GB = 512 GB PDPT Index of Page directory pointer table entry
21-29 9 512 8 512 * 2 MB = 1 GB PD Index of Page directory entry
12-20 9 512 8 512 * 4KB = 2 MB PT Index of Page table entry
0-11 12 4096 (Bytes) 1   Offset Address position from frame base address

Nested page tables in IA-32e (long-mode)

The change in the virtual address was straight forward, bits 39-47 are used as an index into the PML4 table, another changed that you may have noticed is that PDPT increased to 512 possible entries.

PML4 Page table in 64-bit systems

Page table entries in x64

Bit(s) Size Abbreviation Name Description
0 1 P Present 1 = Page is in RAM. 0 = Page Not Present (Triggers #PF Page Fault if accessed).
1 1 R/W Read / Write 0 = Read Only. 1 = Read/Write. (Note: In 64-bit mode, WP bit in CR0 determines if Supervisor can write to Read-Only pages) .
2 1 U/S User / Supervisor 0 = Supervisor (Kernel) Only. 1 = User Mode Allowed. (This acts as the primary gate) .
3 1 PWT Page Write Through 1 = Write-through caching policy.
4 1 PCD Page Cache Disable 1 = The page is not cached.
5 1 A Accessed 1 = CPU sets this when software reads or writes to the page. Used by the OS for “Least Recently Used” (LRU) swapping algorithms.
6 1 D Dirty 1 = CPU sets this when software writes to the page. Used by the OS to know if a page needs to be saved to disk before swapping.
7 1 PAT Page Attribute Table In PTE Indirectly selects the memory type (Cacheable, Uncacheable, Write-Combining) for this page.
8 1 G Global 1 = Prevents the TLB entry from being flushed when CR3 is reset (context switch). Critical for Kernel shared memory.
9 – 11 3 Avail Available Ignored by the CPU. Available for OS use.
12 – 51 40 PFN Physical Frame Number The upper bits of the Physical Address of the 4KB page.
52 – 58 7 Avail Available Ignored by the CPU. Available for OS use (often used for software markers like “Swapped Out”).
59 – 62 4 PK Protection Key (If enabled) used for Protection Keys for User-mode pages (PKU). Otherwise ignored/available.
63 1 XD / NX Execute Disable 1 = Instruction fetches are not allowed from this page. 0 = Execution allowed. (Acts as a veto) .

The cannonical addresses

You may wonder why the virtual address stops at the 47th bit, that is because of what is called a cannonical address. A canonical address is a subset of the virtual addresses, a 64-bit processor theoretically can address memory up to $2^{64}-1$ (16 Exabytes , not something that you would ever need for day to day stuff) it’s also expensive to implement, therefore modern hardware typically only implements 48 bits and possibly up to 57 bits of the address bus.

Since the registers are still 64-bits wide and to avoid discrepancies the CPU handles this through enforcing a rule that is

  • the unused upper bits must match the most significant implemented bit.

In the 48-bit implementation ( Same concept applies to the 57 limit )

  • A user space ranges from 0x0000000000000000 to 0x00007FFFFFFFFFFF
  • A kernel space ranges from 0xFFFF800000000000 to 0xFFFFFFFFFFFFFFFF

Sounds confusing right ? Well let’s look at it from a binary perspective which will make more sense.

As shown in the diagram, bits 0-47 define the address. Bits 48-63 must be identical to Bit 47 (sign-extension). If Bit 47 is 0, all upper bits must be 0 (User Space). If Bit 47 is 1, all upper bits must be 1 (Kernel Space). Because of this, the hexdecimal representation of this example is 0x00007FFFFFFFFFFF

So the next time you see an address starting with 0xFFFF you would know it’s a kernel space address, and if the address starts with 0x0000 you know it’s a user space address.

4. The benefits

4.1 Virtually contiguous allocations

One of the main benefits of paging in virtual memory is allocating “virtually” contiguous pages from a developer’s perspective. As depicted by the following diagram, each process has it’s virtual address spaces, broken down into pages, these pages from software’s perspective are contiguous in their address space, but in reality, they are not; the operating system is mapping those virtual to their physical frames that are not necessarily contiguous.

That doesn’t mean that every page is after the other, for example if you request to allocate a 4 KB block, and then another 4 KB block , they will not necessarily be contiguous, but if you request an 8KB block at once, the OS must allocate them contiguously to avoid confusing the developer as these could be used to store structures or arrays of data.

4.2 Address space & isolation

An Address Space is simply a process’s private view of memory.

Think of it like a notebook.

  • Process A has a notebook. It writes “Password” on Page 1.
  • Process B has its own notebook. It writes “Hello” on Page 1.

Even though they both used “Page 1” (the same Virtual Address), they are writing in completely different physical notebooks. The OS ensures they never touch each other’s pages.

The Layout: Inside this private notebook, the OS organizes data into specific sections (Memory Areas) so the process knows where to find things:

  • Code (Text): Where the actual instructions live.
  • Stack: Where local variables go (grows automatically).
  • Heap: Where you manually allocate memory (like malloc).

Key Takeaway: To the process, its memory looks like one continuous, empty playground starting from 0x0000. In reality, the OS is just picking random empty Physical Frames to hold that data.

As shown in the diagram below, each process is entitled to a virtual address space of its own and cannot access another process’s address space; t virtually cannot access another process’s memory directly since its page tables only map current process pages only, therefore when accessing any memory address, it will simply access one that is mapped, otherwise, an access violation will occur.

4.3 Page sharing & Deduplication

Another notable benefit is the ability to share pages among running processes. Instead of loading the same library/dll each time, operating system maps pages from two separate address spaces to the same frame, which in turn allow libaries to be shared among processes and reducing the duplication of code in memory, there are some requirements that needs to be met in order to safely share these pages among process without posing them to potential security risks.

Some of the security requirements to ensure that these pages can be shared safely are :

  • Code must be re-entrant meaning that it does not rely on any global or static variables, if there are shared resources, they must be synchronized using mechanisms such as mutexes and semaphores, as well as designed to be safely interrupted and resumed to avoid race condition vulnerabilities.
  • Additionally the operating system needs to set proper permissions to prevent one process from modifying any of these pages and result in possible execution of malicious or bad code within other processes, a mechanism such as copy-on-write (COW) could be used to achieve that, which we will later discuss further more in detail.

As depicted by the following diagram, page sharing is possible by linking the virtual page in both processes X and Y to the same physical frame, thus sharing the same portion of code and reducing the space needed to use a single library, hence the name shared libraries.

4.4 Resource Maximization

Another benefit is page swapping; In simple terms, the operating system stores the content of a page on disk (swap out) whenever it runs out of physical memory, this process is known as swapping, and whenever the page is needed it is swapped back into the memory (swap in)

In the following diagram, page C was initialized, and the memory was full; therefore, the operating system “swapped out” page B and stored it in the secondary storage.

Another process wanted to access page B, as it was no longer available in memory; therefore, an exception occurred (Page fault). The operating system realized that it needed to swap in this page back from storage into memory.

5. Conclusion

The big takeaway is that contiguity is only virtual. When your program allocates memory, the OS gives it a continuous range of virtual addresses. But in reality, the OS is scattering your data across random physical frames.

Here is a question that will break your brain, remember the ‘no bypass’ rule, where the CPU can only speak virtual addresses once paging is turned on, well if the CPU needs page tables to translate addresses, but page tables live in memory, how does the OS modify a page table without already having a virtual address for it? and if that page table has a virtual address, wouldn’t that virtual address require a page table to translate it ? Do you see the paradox ?

References