9th March 2024
BitNet Transformer: Scaling 1-bit Transformers for Large Language Models
BitNet Transformer, a architecture that scales 1-bit Transformers for large language models. BitNet Transformer achieves competitive performance while substantially reducing memory footprint and energy consumption compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines.
Key Features:
- BitLinear: A drop-in replacement for the nn.Linear layer in PyTorch, enabling the training of 1-bit weights from scratch.
- Scalable and Stable: BitNet Transformer is designed to be scalable and stable, capable of handling large language models efficiently.
- Competitive Performance: Achieves competitive results in terms of perplexity and downstream task accuracy compared to baselines.
- Significant Energy Savings: Provides substantial energy cost reductions, especially as the model size scales up.
- Scaling Law: Exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models.
Availability: