Serious Memory Series : Part 0
Memory primitives
Primary memory is one of computer’s core components. The objective of this series is to explain memory primitives and techniques for security topics; therefore, the spotlight will be on core concepts rather on implementation details.
Table of content
For this part of the series, a review of microprocessors memory relationship, to get an idea of how microprocessors execute our instructions, and the techniques available to best utilize both microprocessors and memory.
Familiarity with the following topics will be helpful :
- x86 Assembly (Assembly in general)
- C Programming Basics
- Basic data structures and algorithms
1. What is memory ?
Memory is a volatile storage unit, it is made of millions of cells, each cell represents a computational bit (0/1); the CPU uses the main memory to fetch program instructions and data for execution.
There are two technologies commonly used to manufacture memory units, static and dynamic random access memory (RAM), the key difference is the speed and storage size, static is much faster than its dynamic rival, but it’s much more expensive with very limited capacity, therefore S-RAM is not used in main memory because of the size limitation; instead main memory utilizes D-RAM. Below is a memory hierarchy organized from fastest to slowest.
2. CPU and Memory
Every program consists of instructions; an instruction is a basic command executed by the CPU; grouping instructions forms functions; programmers create these functions using a high-level language, then a compiler translates it into machine language. After the compilation, the operating system loads the program into the main memory; from then on, the operating system makes sure it takes a turn in executing accordingly.
Once the program is running, the central processing unit (CPU) is responsible for the execution of the instructions; instructions reside in main memory; that said, a CPU needs to access the memory instantly to execute; as a result, memory speed and number of requests affect the overall performance.
2.1 CPU Registers
Registers are high-speed storage unit, its objective is to make operands required by the CPU instantly available, as previously explained, the faster the CPU accesses the operands, the better it performs, there are different types of registers, some of which are user-accessible, and others only the CPU has access to; used to perform its operations.
General-purpose registers (GPRs) - these registers are used to store temporary data that the CPU operates on for faster execution, user can access general-purpose instructions.
Control registers (CR0 - CR7) - registers used to control the CPU behaviors, such as paging, addressing modes, it is only accessible by the Kernel. (x86/x86_64)
Internal registers - registers used by the CPU to perform its operations, only the CPU has access to these registers.
- Program counter (PC) - Holds address of the next instruction to be executed
- Memory address register (MAR) - Holds memory address to read from or to write to.
- Memory buffer register (MBR) - Holds data read from memory or waiting to be written.
- Instruction register (IR) - Holds the opcode and operand addressing bits.
2.2 Instruction cycle
The CPU receives its instructions in a hexadecimal format referred to as operation codes (opcode), each operation code includes operands among other required information; to operate, a CPU goes through a cycle referred to as the instruction cycle to perform the instruction’s intended function.
- Fetching - Use the PC register to retrieve the next instruction into the IR.
- Decoding - Identify the function and its operands’ addressing from the opcode stored in IR.
- Execute - Send signals to the appropriate execution units to perform instruction’s function.
- Write - Write back the operation results.
2.3 Performance gap
Over the years, microprocessors improved significantly as predicted by Moore’s law, creating a huge performance gap between CPU and memory.
How does the gap affect the overall performance ?
As previously explained, microprocessors heavily rely on memory for storing and loading data or instructions; for that reason, the performance gap is going to decrease the system’s throughput. For example, a racing car on a high-way, if there are steep speed bumps all across the high-way (analogous to the number of requests and waiting time), this car’s potential is wasted. Utilizing the idle time and reducing the number of requests made to the main memory are the objectives needed to bridge the gap.
3. Bridging the gap
There are several ways to bridge the gap, in this section we will review how caching, computer architecture design, and processing pipelines improve the overall performance and help us bridge the gap.
3.1 Caching
A cache is a high-speed memory chip, it utilizes the power of static random access memory, therefore it has a very limited capacity. Caches hold a copy of data and/or instructions that a microprocessor frequently needs access to, it allows the reuse of previously retrieved/computed data, reducing the number of requests made to the main memory, therefore it enhances the overall performance. There are different levels of caches as displayed in the memory hierarchy figure.
L1 caches are placed on the microprocessor chip itself removing the time needed to cross chip boundaries, resulting in higher performance, and less clock cycles wasted to retrieve instructions or data.
L2 and L3 caches are place outside the microprocessor, these are slower than the L1 cache with larger storage capacity respectively, and still faster than main memory.
3.2 Architecture design
Computer architecture design is another aspect that needs to be considered, it significantly improves the performance, the below figure represents a design named Modified Harvard also known as Almost Von-Neumann, it’s a mixture of both Von-Neumann and Harvard architecture, we will quickly go through an overview of both architecture’s pros and cons to see how it does improve the performance.
3.2.1 Von-Neumann
This computer architecture design has a set of requirements in order to be fulfilled, we will mainly focus on the memory component. The memory component in this architecture designs stores both data and instructions required by the CPU to execute, this design has its pros and cons.
Pros
- Instructions are treated as data, making self-modifying code and Just-in-time compilers possible.
- Data space can be used to store instructions, no maximum division ration.
Cons
- Shared bus, therefore instruction fetching and data operations cannot occur at the same time setting a limitation known as Von-neumann bottleneck effect.
3.2.2 Harvard
Unlike Von-neumann, this computer architecture design requires a separate memory storage for both data and instructions.
Pros
- Separate buses, therefore instruction fetching and data operations can occur simultaneously, resulting in a higher performance.
Cons
- Instructions cannot be treated as data, therefore self-modifying code and Just-in-time compilers are not possible.
- Data and instructions have a maximum capacity and cannot be expanded.
3.2.3 Modified Harvard (Almost Von-Neumann)
This architecture implements a hybrid memory structure, a mixture of both Von-Neumann and Harvard. For the L1 cache, it uses two separate units for data and instructions, gaining the simultaneity of data and instruction operations from the Harvard model, resulting in a higher performance, for the levels below that it uses the mixed data and instructions model from Von-Neumann, gaining its benefits and flexibility. Below is a figure that shows a simple implementation of the modified Harvard architecture.
3.3 Instruction cycle optimization
As previously seen in the CPU execution section, each instruction has to go through the instruction cycle; waiting for each instruction to finish the full cycle before introducing another one is not optimal, it leaves certain stages idle waiting for the next instruction. Below is a diagram showing two instructions (A, B) going individually.
Calculating how many time units four instructions would need to individually execute one after another :
\[\begin{gather} n = \textrm{Number of instructions} \newline k = \textrm{Number of stages} \newline \tau=\textrm{Instruction cycle time} \end{gather}\] \[\begin{gather} T = n\ \times k\ \times \ \tau \newline T = 4\ \times\ 4 \ \times \tau= 16\tau \end{gather}\]3.3.2 Pipelines
Pipelines decouple the instruction cycle into separate stages; therefore, instructions can overlap each other and go through the stage prior instructions finished. Analogous to a huge car manufacturing line, it decouples the process into stages; once a car completes the stage, another car occupies it, keeping all stages busy all the time.
Below figure demonstrates a generic four stage pipeline with four instructions (A,B,C,D).
Pipeline time units calculation (no branching)
\[\begin{gather} T_k = [\ k + (n - 1) \ ]\ \tau \newline T_4 = [\ 4\ +(4-1)]\ \tau = 7\tau \end{gather}\]Notice the difference between a microprocessor with a pipeline and one without; it took a microprocessor with a pipeline nine time-units less than one without to execute the same number of instructions; unfortunately, other factors can affect the speed difference such as hazards and pipeline bubbles.
4. Main memory
Now that we have a generic idea about the relationship between CPU and memory, we can explore the memory with more confidence, first we will be introduced to the physical and logical separation of memory and its benefits, then we will dig deeper in the following parts.
4.1 Physical memory
Physical memory is the DRAM chip placed inside the computer, a simple array of bytes; the CPU can directly access. Physical memory is invisible for programmers as virtual memory represents it; It is logically divided into chunks referred to as page frames, virtual memory also has these frames referred to as virtual pages.
4.2 Logical/Virtual memory
Virtual memory is a representation of physical memory; making it easier for a programmer to make use of it without worrying about all the details handled by the operating system. In modern operating systems, each process has its own space referred to as address space; these spaces are divided into chunks referred to as virtual pages, mapped using kernel data structures to a page frame in physical memory, benefits include:
- Virtually contiguous allocations, as it does not matter where the frame is located as long as there is a mapping that tells us where this virtual page resides in physical memory.
- Separate space for each program, increasing security and isolation, therefore a malicious process cannot access another program’s memory
- Reduces fragmentation, since we can reuse physical frames wherever they are and map them to the virtual pages without having to worry about contiguity or data’s actual location.
- Maximization of memory usage by moving unused virtual pages to the hard disk to allow another process to execute, and restoring these pages when needed.
4.3 Memory management unit (MMU)
Program instructions use virtual addresses, but when it comes to fetching or storing data into memory, the CPU needs the actual physical address. The MMU is here for that reason, it is responsible for the translation of a virtual address to the physical address, one of its key benefits is preventing programs from accessing another address space during the address translation process causing an error referred to as page fault, which occurs when a process tries to access an address it has no permissions to or the page is not currently mapped into memory.
4.3.1 Translation look-aside buffer (TLB)
Translation is quite a tedious process and creates overhead when accessing memory, to minimize the overhead, a buffer that resides in cache referred to as Translation look-aside buffer exists, this buffer is checked before a translation occurs, if the translation was previously done and recent enough it will be found in that buffer resulting in TLB hit, on the other hand, if it is not it is considered a TLB miss, and the translation process starts.
5. References
https://www.robots.ox.ac.uk/~dwm/Courses/2CO_2014/2CO-N2.pdf
https://en.wikibooks.org/wiki/A-level_Computing/AQA/
http://www.cs.uwm.edu/classes/cs315/Bacon/Lecture/HTML/ch05s06.html
https://en.wikipedia.org/wiki/Harvard_architecture
https://en.wikipedia.org/wiki/Von_Neumann_architecture
https://en.wikipedia.org/wiki/Modified_Harvard_architecture
http://ithare.com/modified-harvard-architecture-clarifying-confusion/#rabbitref-WikiModifiedHarvard
http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_speed.pdf
http://users.utcluj.ro/~baruch/book_ssce/SSCE-Intel-Pipeline.pdf
https://en.wikipedia.org/wiki/Instruction_pipelining
https://en.wikipedia.org/wiki/Bubble_(computing)
https://www.javatpoint.com/os-translation-look-aside-buffer