08 February 2024

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Written by Startup Idea

Authors: Raghav Addanki, Chenyang Li, Zhao Song, Chiwun Yang

Published on: November 24, 2023

Impact Score: 8.38

Arxiv code: Arxiv:2311.14652

Summary

What is new: The introduction of an algorithm that uses sublinear space to handle super-long tokens efficiently in Large Language Models.
Why this is important: Managing high space complexity in deploying LLMs for streaming applications with contexts longer than 128K.
What the research proposes: A new algorithm that processes data in a streaming fashion using three sketch matrices, significantly reducing memory usage.
Results: With the increase in token length, the algorithm ensures reduced error guarantee and nearly constant memory usage, demonstrating significant memory-efficient performance.

Technological frameworks used: Polynomial method approximation for attention output with a streaming fashion algorithm.

Models used: Single-layer self-attention with Query, Key, and Value matrices.

Data used: Super-long tokens longer than 128K.

Streaming services, cloud computing platforms, and companies deploying large language models could greatly benefit or face disruption.

We have generated a startup concept here: StreamlineAI.