turboderp 009424a6d4 Consolidate and tidy up wheel/release workflow		5 days ago
.github	Consolidate and tidy up wheel/release workflow	5 days ago

conversion	Allow quantizing models with max_seq_len < 2048	1 week ago

doc	typo	1 month ago

examples	Skip first forward pass when rewinding after banned string	6 days ago

exllamav2	Bump to v0.0.21	6 days ago

tests	Add Granite formatting to HumanEval test	1 week ago

util	Fix converting files with docker command	2 months ago

.gitignore	better gitignore	3 months ago

LICENSE	Add license	8 months ago

MANIFEST.in	Setuptools script	8 months ago

README.md	Fix installation step & add multi-GPU explanation	1 month ago

convert.py	Allow quantizing models with max_seq_len < 2048	1 week ago

model_diff.py	Update model_diff.py to use new attn params	4 months ago

requirements.txt	Add wheel and setuptools to requirements to fix potential issues with torch 2.2.4	1 month ago

setup.py	Add C++ function for partial string matching in generator	6 days ago

test_inference.py	Better sampling settings for test gen	1 week ago

README.md

ExLlamaV2

ExLlamaV2

ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.

Overview of differences compared to V1

Faster, better kernels
Cleaner and more versatile codebase
Support for a new quant format (see below)

Performance

Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:

Model	Mode	Size	grpsz	act	3090Ti	4090
Llama	GPTQ	7B	128	no	177 t/s	198 t/s
Llama	GPTQ	13B	128	no	109 t/s	111 t/s
Llama	GPTQ	33B	128	yes	44 t/s	48 t/s
OpenLlama	GPTQ	3B	128	yes	252 t/s	283 t/s
CodeLlama	EXL2 4.0 bpw	34B	-	-	44 t/s	50 t/s
Llama2	EXL2 3.0 bpw	7B	-	-	211 t/s	245 t/s
Llama2	EXL2 4.0 bpw	7B	-	-	179 t/s	207 t/s
Llama2	EXL2 5.0 bpw	7B	-	-	159 t/s	170 t/s
Llama2	EXL2 2.5 bpw	70B	-	-	33 t/s	37 t/s
TinyLlama	EXL2 3.0 bpw	1.1B	-	-	623 t/s	730 t/s
TinyLlama	EXL2 4.0 bpw	1.1B	-	-	560 t/s	643 t/s

How to

To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio
on Windows). Also make sure you have an appropriate version of PyTorch,
then run:

git clone https://github.com/turboderp/exllamav2
cd exllamav2
# Optionally, create and activate a new conda environment
pip install -r requirements.txt
pip install .

python test_inference.py -m <path_to_model> -p "Once upon a time,"
# Append the '--gpu_split auto' flag for multi-GPU inference

A simple console chatbot is included. Run it with:

python examples/chat.py -m <path_to_model> -mode llama
# Append the '--gpu_split auto' flag for multi-GPU inference

The -mode argument chooses the prompt format to use. llama is for the Llama(2)-chat finetunes, while codellama
probably works better for CodeLlama-instruct. raw will produce a simple chatlog-style chat that works with base
models and various other finetunes. Run with -modes for a list of all available prompt formats. You can also provide
a custom system prompt with -sp.

Integration and APIs

TabbyAPI is a FastAPI-based server that provides an OpenAI-style web API
compatible with SillyTavern and other frontends.
ExUI is a simple, standalone single-user web UI that serves an ExLlamaV2 instance
directly with chat and notebook modes.
text-generation-webui supports ExLlamaV2 through the exllamav2
and exllamav2_HF loaders.
lollms-webui supports ExLlamaV2 through the exllamav2 binding.

Installation

Method 1: Install from source

To install the current dev version, clone the repo and run the setup script:

git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .

By default this will also compile and install the Torch C++ extension (exllamav2_ext) that the library relies on.
You can skip this step by setting the EXLLAMA_NOCOMPILE environment variable:

EXLLAMA_NOCOMPILE= pip install .

This will install the "JIT version" of the package, i.e. it will install the Python components without building the
C++ extension in the process. Instead, the extension will be built the first time the library is used, then cached in
~/.cache/torch_extensions for subsequent use.

Method 2: Install from release (with prebuilt extension)

Releases are available here, with prebuilt wheels that contain the
extension binaries. Make sure to grab the right version, matching your platform, Python version (cp) and CUDA version.
Either download an appropriate wheel or install directly from the appropriate URL:

pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

The py3-none-any.whl version is the JIT version which will build the extension on first launch. The .tar.gz file
can also be installed this way, and it will build the extension while installing.

Method 3: Install from PyPI

A PyPI package is available as well. It can be installed with:

pip install exllamav2

The version available through PyPI is the JIT version (see above). Still working on a solution for distributing
prebuilt wheels via PyPI.

EXL2 quantization

ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format. EXL2 is based on the same
optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. The format allows for mixing quantization
levels within a model to achieve any average bitrate between 2 and 8 bits per weight.

Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse
quantization wherein more important weights (columns) are quantized with more bits. The same remapping trick that lets
ExLlama work efficiently with act-order models allows this mixing of formats to happen with little to no impact on
performance.

Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a
combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target
average bitrate.

In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a 2048-token context, producing coherent
and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently
none of them uses GQA which effectively limits the context size to 2048. In either case it's unlikely that the model
will fit alongside a desktop environment. For now.

Conversion

A script is provided to quantize models. Converting large models can be somewhat slow, so be warned. The conversion
script and its options are explained in detail here

Community

A test community is provided at https://discord.gg/NSFwVuCjRq
Quanting service free of charge is provided at #bot test. The computation is generiously provided by the Bloke powered by Lambda labs.

HuggingFace repos

I've uploaded a few EXL2-quantized models to Hugging Face to play around with, here.
LoneStriker provides a large number of EXL2 models on Hugging Face.
bartowski has some more EXL2 models on HF.

No Description

Python Text Cuda C++ C

11859846+turboderp@users.noreply.github.com turboderp@users.noreply.github.com 113042016+awtrisk@users.noreply.github.com 3887729+jllllll@users.noreply.github.com akkoyunsinan2@gmx.de alpindale@gmail.com net4orion@163.com 32474602+Lyrcaxis@users.noreply.github.com zgce@163.com 49133878+deltaguo@users.noreply.github.com aloui.seifeddine@gmail.com shuriken209master@googlemail.com floer.learner.01@gmail.com 85707358+Kerushii@users.noreply.github.com 44341163+AAbushady@users.noreply.github.com ben@unifiedlearning.ai ivan.sanchez@zyte.com min.xu.public@gmail.com public4orion@163.com 49538361+silphendio@users.noreply.github.com seanl@literati.org 134447697+ardfork@users.noreply.github.com bdashore3@proton.me kenan@sly.mn eng.eramax@gmail.com

How to access data resources in code