Learn/Core Concept How does quantisation compress neural networks? Quantisation reduces neural network model sizes by representing weights and activations with fewer bits per value. Instead of 32-bit floating point numbers, models use 16-bit, 8-bit, or even 1-bit integers. This dramatically cuts memory usage and speeds up inference. Developers see quantisation everywhere because it makes large models deployable on consumer hardware. BitNet shows 1-bit quantisation maintaining performance whilst slashing compute requirements. It's the difference between needing a data centre GPU or running locally on your laptop. InferenceCompressionBitNet |