Learn/Core Concept What is KV-cache in transformer models? KV-cache stores precomputed key-value pairs from previous tokens to avoid recalculating attention weights during inference. Instead of recomputing attention for every token in the sequence, the model reuses cached values, dramatically reducing compute cost for long sequences. KVarN shows how quantising this cache can boost capacity 5x whilst maintaining accuracy. QuantisationAttention |