A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Aug 17, 2022 • 16
view post Post Check out quantized weights from ISTA-DAS Lab directly in their organisation page: https://huggingface.co/ISTA-DASLab ! With official weights of AQLM (for 2bit quantization) & QMoE (1-bit MoE quantization)Read more about these techniques below:AQLM paper: Extreme Compression of Large Language Models via Additive Quantization (2401.06118)QMoE: QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models (2310.16795)Some useful links below:AQLM repo: https://github.com/Vahe1994/AQLMHow to use AQLM & transformers: https://huggingface.co/docs/transformers/quantization#aqlmHow to use AQLM & PEFT: https://huggingface.co/docs/peft/developer_guides/quantization#aqlm-quantizaionGreat work from @BlackSamorez and team !
view post Post Try out Mixtral 2-bit on a free-tier Google Colab notebook right now! https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing AQLM method has been recently introduced on transformers main branch The 2bit model can be found here: BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch And you can read more about the method here: https://huggingface.co/docs/transformers/main/en/quantization#aqlmGreat work @BlackSamorez and team!
Sharded checkpoints useful sharded checkpoints for users to run inference / fine-tuning on a Google colab without having to deal with CPU OOM issues. ybelkada/falcon-7b-sharded-bf16 Text Generation • Updated 26 days ago • 7.14k • 19 ybelkada/blip2-opt-2.7b-fp16-sharded Visual Question Answering • Updated Apr 12, 2023 • 7.55k • 1 ybelkada/flan-t5-xl-sharded-bf16 Text2Text Generation • Updated Feb 16, 2023 • 2.69k • 11 ybelkada/mpt-7b-bf16-sharded Text Generation • Updated Jul 21, 2023 • 18