1. Introduction: The Memory Wall Problem

Consider if a Formula 1 car is forced to refuel through a drinking straw. This reflects the challenge the CPU faces. As a processor executes instructions in nanoseconds, accessing the main memory, which is DRAM, takes hundreds of cycles. This performance gap is called the "memory wall," and it threatens to stall computational progress.

CPU Speed
DRAM Speed

The "Memory Wall" Performance Gap

Its solution is a sophisticated memory hierarchy using caching. These layers work in concert to deliver more than 95% of requested data within 1-2 cycles, masking memory latency and enabling modern computing. In this guide, we will optimize systems with cache memory write policies, non-blocking caches, and advanced techniques.

2. The Caching Pyramid: L1, L2, and L3

Caches are small in memory size, ultrafast static RAM (SRAM) units, and act as a temporary storage element between the CPU and slower main memory (DRAM). Their design exploits temporal locality (recently accessed data is reused) and spatial locality (adjacent data is likely needed soon).

L1 Cache
L2 Cache
L3 Cache
Main Memory (DRAM)

Modern CPU Cache Structure

Cache level Memory Size Latency Location Associativity
L1 Instruction 32-64 KB 1-3 cycles Per-core 4-8 way set-assoc
L1 Data 32-64 KB 1-3 cycles Per-core 8-12 way set-assoc
L2 256KB-2MB 8-12 cycles Per-core/Shared 16 way set-assoc
L3 (LLC) 16-128 MB 30-50 cycles Shared across cores 16-32 way set-assoc

3. Write Policy:

  1. Write-Through (WT): During each write operation, data is simultaneously written to the main memory and the cache. As it waits for slower memory writes (DRMA), it has high write latency. Because each write generates duplicate traffic, bandwidth usage rises. It is used in critical systems requiring strong consistency, e.g., financial databases and RAID controllers.
  2. Write-Back (WB): Data is written to cache only initially. Main memory updates occur later, e.g., during cache line eviction. Modified lines are marked with a "dirty bit." It has low write latency as the CPU proceeds after the cache write. There is a risk of data loss on a power failure. Multi-core systems require sophisticated coherence protocols. It is employed in write-intensive workloads, e.g., video rendering and scientific simulations.
  3. Write-Around (WA): It goes straight to the main memory, avoiding the cache. The cache is updated only if the data is read later. Cache pollution is reduced as it avoids filling the cache with one-time writes. It has a high read miss penalty if bypassed data is later accessed. It is used in logging systems or workloads with low read-after-write locality.

4. Allocation Policies: Handling Write Misses

When a write targets data not in the cache, policies decide whether to fetch the block:

Allocation Policies

Policy Mechanism Performance Impact
Write-Allocate Loads block into the cache, then updates Benefits read-after-write sequences and increase write latency
No-Write-Allocate Write directly to memory; skips cache. Faster for isolated writes and subsequent reads suffer misses

Common Pairings:

  • Write-Back + Write-Allocate: It maximizes repeated writes, such as CPU L1 caches.
  • Write-Through + No-Write-Allocate: This prevents needless cache loading, such as I/O buffers.

5. Measuring Cache Performance:

Hit Rate and Miss Rate: The hit rate is defined as "cache hits divided by total cache accesses." Hit rates greater than 95% are typical for well-tuned L1 caches; L2/L3 might see 80-90%.

Hit Rate (HR) = Cache Hits / Total Cache Accesses
Miss Rate (MR) = 1 - HR = Cache Misses / Total Cache Accesses
Miss Penalty (MP): Additional cycles required to service a miss from the next level, e.g., L2 or DRAM. If an L1 miss costs 10 cycles to fetch from L2, then MP = 10 cycles.

Average Memory Access Time (AMAT):

AMAT gives a single-number view of performance impact:

AMAT = L1 Hit Time + (L1 Miss Rate × L1 Miss Penalty)

If you expand two levels:

AMAT = TL1 + MRL1 × (TL2 + MRL2 × TDRAM)
  • TL1 = L1 hit latency
  • MRL1 = L1 miss rate
  • TL2 = L2 hit latency
  • MRL2 = L2 miss rate
  • TDRAM = Dram access time

6. Blocking vs. Non-Blocking Caches:

Blocking Cache

Whenever there is a cache miss, the entire process pipeline stalls until data is fetched from lower memory levels. A single outstanding miss is allowed while subsequent requests wait in a queue. Its analogy can be a toll booth where cars wait in line, with only one vehicle processed at a time.

Non-Blocking Cache

It keeps serving cache hits while permitting several unresolved misses. It uses Miss Status Handling Registers (MSHRs) to track pending requests. Its analogy can be a drive-thru, which has parallel ordering stations, and cars can place new orders while others are being prepared.

7. Conclusion:

Mastering cache memory is most important for breaking the memory wall. Optimizing your hierarchy, policies, and advanced features to accelerate applications from simulation to gaming.