Authors: Minchen Yu, Ao Wang, Dong Chen, Haoxuan Yu, Xiaonan Luo, Zhuohao Li, Wei Wang, Ruichuan Chen, Dapeng Nie, Haoran Yang
Published on: June 06, 2023
Impact Score: 8.22
Arxiv code: Arxiv:2306.03622
Summary
- What is new: The introduction of FaaSwap, a GPU-efficient serverless inference platform that dynamically swaps models onto GPUs to enable efficient sharing and meets latency SLOs.
- Why this is important: Current serverless platforms do not efficiently support GPUs, limiting low-latency inference for machine learning.
- What the research proposes: FaaSwap utilizes asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management with an interference-aware request scheduling algorithm.
- Results: FaaSwap can serve hundreds of functions on a single node with 4 V100 GPUs, achieving performance comparable to native execution. On a 6-node testbed, it meets latency SLOs for over 1k functions.
Technical Details
Technological frameworks used: FaaSwap
Models used: Interference-aware request scheduling algorithm
Data used: Real-world use cases on a leading commercial serverless platform
Potential Impact
Serverless computing providers and companies relying on machine learning inference could significantly benefit; may disrupt traditional cloud computing models.
Want to implement this idea in a business?
We have generated a startup concept here: InferFlow.
Leave a Reply