Tree-PLRU based L1 Eviction Timer

When the CPU loads a value into the L1 cache, it usually needs to choose an existing entry to evict first. There are different eviction strategies to do this, but most modern desktop CPUs use a tree based "Pseudo Least Recently Used" algorithm, Tree-PLRU in short. A given cache set is organized in a tree structure and every node stores which side of the tree has been accessed farther in the past (least recently used). Choosing a good candidate for eviction is then as simple as following the arrows in the tree.

This also means that in the worst case, an element that has been accessed once can stay in the cache forever. This happens if the neighbor in the tree gets accessed everytime the element is about to be evicted. We can use this to test for an L1 access using low precision timers as follows:

Fill the cache set with known values
Trigger the code to test, which might or might not access the cache set
Repeatedly access the known values from step 1, but keep the unknown element in the cache by accessing its neighbor when needed

The repeated accesses in step 3 will either trigger an L1 cache hit or a cache miss, depending on the outcome of step 2. As an example for the speed difference, a load from L1 takes 4 cycles on my machine, while an L2 load takes 12 cycles.

Check out the animation on the right and compare what happens when there's a cache hit from the instrumented code, i.e. the unknown element in red marked with an X. The element to keep it alive is marked with a K. You will see that they always end up as neighbors even though the initial state of the cache is randomized. Note how everytime the arrows point to X, we access K instead to keep it in the cache.

Repeating this access pattern allows us to increase the time difference until we can measure it with our low precision timer.

Simplified tree:
Speed:

Tree-PLRU based L1 Eviction Timer

Cache access in instrumented code

No cache access in instrumented code