Authors: Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song
Published on: January 11, 2024
Impact Score: 8.3
Arxiv code: Arxiv:2401.06199
Summary
- What is new: A unified protein language model, xTrimoPGLM, that handles protein understanding and generation tasks simultaneously at a scale of 100 billion parameters and 1 trillion training tokens.
- Why this is important: Existing protein models struggle with performing protein understanding and generation tasks at the same time.
- What the research proposes: Introduce xTrimoPGLM which combines autoencoding and autoregressive pre-training objectives to achieve both tasks effectively.
- Results: xTrimoPGLM surpasses other baselines in understanding benchmarks and enhances 3D structural prediction, while also generating high-quality de novo protein sequences and allowing programmable generation after SFT.
Technical Details
Technological frameworks used:
Models used: xTrimoPGLM, a unified protein language model with autoencoding and autoregressive capabilities.
Data used: 1 trillion training tokens.
Potential Impact
Biotechnology and pharmaceutical companies, computational biology market, protein engineering, and drug discovery industries.
Want to implement this idea in a business?
We have generated a startup concept here: BioInnovateAI.
Leave a Reply