these are ongoing notes on cache coherence, updating regularly.
Modern processors maintain private caches per core to speed up memory accesses. However, these private caches can contain the same cache location, a problem arises when two processor can see different values for the same cache location.
Each core has private L2 and L1 caches and shared L3.
Cache coherence algorithms maintain two invariants [Jerger and Martonosi]:
These invariants are implemented inside of the coherence controllers .
ld/st
instructions.
There are different schemes to implement the upper two methods:
TLDR : Whenever a cache line changes its state broadcast to all other cores.
TLDR : Instead of broadcasting to cache state all other cores, send the cores that contain that particular cache line.
Similar to phonebooks, little piece of SRAM that is a directory of the sharer of a particular cache block. We don’t need to see all transactions now. Previously, the bus was the bottleneck, centralization point. Now all caches will going to send messages to the coherence directory. Now there is no need to watch all the transactions on the bus, now communicate when necessary.
The directory stores all tags that are present in L1 caches. Set-mapping functions must be same.
Example: If we have 2 CPUs with 4-way L1 caches then directory must be 8-way associative. However, the width of the directory is not sustainable in terms of power and latency.
Current hardware use sparse directories . See the second illustration, so basically, we get some bits from tag bit while indexing the directory. In this way, we can build directories in a narrower fashion.
Bus is a shared communication channel which connects multiple components while interconnects are point-to-point communication channels.
For scalable communication through bus modern processors implement two types of interconnects:
Now ask your neighbors.
Exclusive
state to reduce the traffic on the bus while accessing the cache line. We don’t need to broadcast others.
However, scaling this communication through buses is not feasible. To overcome that there are point to point interconnects .
Example Assume MSI protocol. In here assume CPU 0 wants to read X. First cache controller checks if X is already in the cache or not. Then it asks for the directory. Directory sets X as Shared and gives to CPU 0.
Also CPU 1 gets the X. Now directory sees X(S): [C_0, C_1] Now CPU 0 wants to write X. To write it down first it needs to invalidate the state in CPU 1 Send INV message to CPU 1 When CPU 1 is invalidated its state sends back an ACK message. Then when the directory receives this message it sends back ACK to CPU 0, and then CPU 0 updates its state to Modified.
TODO https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html
TODO