Instructions to use microsoft/bloom-deepspeed-inference-fp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/bloom-deepspeed-inference-fp16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="microsoft/bloom-deepspeed-inference-fp16")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("microsoft/bloom-deepspeed-inference-fp16") model = AutoModel.from_pretrained("microsoft/bloom-deepspeed-inference-fp16") - Notebooks
- Google Colab
- Kaggle
How to split tensors to x shards?
#1
by Ede-CH - opened
Can you provide the script of splitting original tensors into 8 shards?
If you want to perform inference, you can directly assign mp_size = 8 as a parameter of deepspeed.init_inference().
Thanks for your reply! According to my understanding, this parameter divides the model weights into eight parts based on tensor parallelism (TP) after loading the model weights. However, since the model weights have not been previously sharded based on TP, the loading time can be quite long. In the weight files provided by you, each file only saves a portion of the matrix, allowing for direct loading. Could you please provide the script for pre-sharding the weights?