Rui's Blog

Lecture 4: The memory hierarchy. Caches.

Lecture Summary

  • Execution times
  • Memory related issues
  • The memory hierarchy
  • Caches

Execution Times - Nomenclature

  • Wall Clock Time: Amount of time from the beginning to the end of a program
  • CPU Execution Time: Amount of time on the CPU that's dedicated to your program, requires a profiling tool to access
    • User Time: Time spent processing instructions compiled out of code generated by the user or in libraries that are directly called by user code
    • System Time: Time spent in support of the user’s program but in instructions that were not generated out of code written by the user (e.g., OS support for opening/reading a file, throwing an exception, etc.)
  • Clock cycle: The length of the period for the processor clock (e.g., a 1GHz processor has a clock cycle of 1 nanosecond)
  • The CPU Performance Equation: CPU Execution Time = Instruction Count * Clock-Cycles per Instructions (CPI) * Clock Cycle Time = Instruction Count * Clock-Cycles per Instructions (CPI) / Clock Rate
The SPEC CPU benchmark. CPI<1: Multiple-issue is in play. For combinational optimization, there are probably a lot of pipeline stalls

Memory & Cache

  • SRAM (Static Random Access Memory): Expensive but fast (short access time), bulky, transistor hog, needs no refresh
  • DRAM (Dynamic ~): Cheap but slow, information stored as a charge in a capacitor, higher capacity per unit area, needs refresh every 10-100ms, sensitive to disturbances
The memory hierarchy (the pyramid of tradeoffs):
  • A dedicated hardware asset called MMU (Memory Management Unit) is used to manage the hierarchy
  • Tradeoff:
    • DRAM off-chip: Main memory
    • SRAM on-chip: Cache
      • Caches have a deeper hierarchy: L1+L2+L3. L1 is faster and smaller than L2 & L3.
      • Different types of caches
        • Data caches: Feeds processor with data manipulated during execution
        • Instruction caches: Stores instructions
      • The ratio between cache size & main memory size: ~1:1000
The reason why cache works is the principle of locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently.
  • Temporal locality: Recently referenced items are likely to be referenced again in the near future
    • Data references: For example, in the code snippet below, the variable sum gets referenced at each iteration
    • Instruction references: The loop is cycled through repeatedly
  • Spatial locality: Items with nearby addresses tend to come into use together
    • Data references: The elements in the array abc are accessed in succession (stride-1 reference pattern)
    • Instruction references: The instructions are referenced in sequence
sum = 0;
for (i = 0; i < n; i++)
sum += abc[i];
return sum;

Case study: Adding the entries in an N-dimensional matrix (not covered in class)

Take-home message: Well-written programs leverage data/instruction locality (which brings cache into the play) for better performance