Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
Published on: February 05, 2024
Impact Score: 8.07
Arxiv code: Arxiv:2402.0275
Summary
- What is new: A tuning-free 2bit KV cache quantization algorithm, named KIVI, that significantly reduces peak memory usage and increases inference throughput for large language models.
- Why this is important: The increasing memory demands of key-value caches in large language models slow down inference speed and increase costs.
- What the research proposes: Implementing a novel quantization approach where the key cache is quantized per-channel and the value cache per-token, drastically reducing memory usage without sacrificing model quality.
- Results: Enabled models like Llama-2, Falcon, and Mistral to use 2.6 times less memory and increased batch sizes up to 4 times, achieving 2.35 to 3.47 times higher throughput.
Technical Details
Technological frameworks used: Hardware-friendly implementation of KIVI algorithm
Models used: Llama-2, Falcon, Mistral
Data used: Element distribution in KV cache
Potential Impact
Cloud computing providers, AI service platforms, and companies relying on large language models for text generation or analysis could significantly reduce operational costs and improve service efficiency.
Want to implement this idea in a business?
We have generated a startup concept here: QuantaCache.
Leave a Reply