关于GCU、沐曦GPGPU、MLU、0卡V100资源4月7日恢复上架的公告>>> 关于共建具身智能开源数据集的倡议>>> 关于云脑任务中统一路径访问方式的公告>>> 关于将启智集群GPU资源迁移至智算集群的公告>>>

History

Zach Mueller 7a2feecad4 Add copyright + some ruff lint things (#2523 ) * Copyright and ruff stuff * lol		2 months ago
..
README.md	Add benchmarks (#506)	1 year ago

big_model_inference.py	Update quality tools to 2023 (#1046)	1 year ago

measures_util.py	Add copyright + some ruff lint things (#2523)	2 months ago

README.md

Big model inference benchmarks
- Setup
- Results

Big model inference benchmarks

Running inference with Accelerate on big models.

Setup

These benchmarks use the transformers library:

pip install transformers

To reproduce or test a new setup, run

python inference_acc.py model_name

This script supports gpt-j-6b, gpt-neox, opt (30B version) and T0pp out of the box, but you can specify any valid checkpoint for model_name.

To force a different torch_dtype than the one in the config: --torch_dtype xxx.

If you get an error linked to disk offload, you need to add the option --disk-offload

Results

On a setup with two Titan RTXs (24GB of RAM) and 32GB of RAM, we get the following benchmarks (T0pp does not run in float16, which is why it's not included).

Model	Model load time	Generation time	dtype	GPU 0 use	GPU 1 use	CPU use	Disk offload
GPT-J-6B	8.7s	0.05s per token	float16	11.7GB	0GB	0GB	no
GPT-J-6B	12.4s	0.06s per token	float32	21.9GB	1.5GB	0GB	no
GPT-Neo-X-20B	30.9s	0.08s per token	float16	21.5GB	18GB	0GB	no
GPT-Neo-X-20B	78.2s	10.72s per token	float32	20.3GB	22.7 GB	24.4GB	yes
T0pp (11B)	29.4s	0.05s per token	float32	21.1GB	21.3GB	0GB	no
OPT-30B	34.5s	2.37s per token	float16	20.7GB	22.3GB	14.1GB	no
OPT-30B	112.3s	33.9s per token	float32	20.2GB	21.2GB	23.5GB	yes

Note on the results:

using two GPUs instead of one does not slow down generation
using CPU offload slows down a bit (see OPT-30b)
using disk offload slows down a lot (need to implement prefetching)

You will also note that Accelerate does not use anymore GPU and CPU RAM than necessary:

peak GPU memory is exactly the size of the model put on a given GPU
peak CPU memory is either the size of the biggest checkpoint shard or the part of the model offloaded on CPU, whichever is bigger.

No Description

Python Markdown Dockerfile CSV Makefile

muellerzr@gmail.com 35901082+sgugger@users.noreply.github.com sylvain.gugger@gmail.com 13534540+pacman100@users.noreply.github.com 57196510+SunMarc@users.noreply.github.com Sylvain.gugger@gmail.com 49240599+younesbelkada@users.noreply.github.com 30946547+abhilash1910@users.noreply.github.com stas00@users.noreply.github.com yi.a.wang@intel.com BenjaminBossan@users.noreply.github.com 9808326+fxmarty@users.noreply.github.com 108629034+LiamSwayne@users.noreply.github.com 59462357+stevhliu@users.noreply.github.com akx@iki.fi patrick.v.platen@gmail.com thomwolf@users.noreply.github.com hzji210@gmail.com 31883449+Chris-hughes10@users.noreply.github.com ryanrussell@users.noreply.github.com xyuaj@connect.ust.hk this@zyc.ai fanli.lin@intel.com pedro@huggingface.co 32632186+philschmid@users.noreply.github.com

How to access data resources in code

README.md

Big model inference benchmarks

Setup

Results

Contributors (25+) All

Contributors (25+)
All