Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
BAAI-WuDao 535ae9a08e | 2 years ago | |
---|---|---|
.. | ||
plugin | 2 years ago | |
python | 2 years ago | |
.clang-format | 2 years ago | |
.editorconfig | 2 years ago | |
.gitignore | 2 years ago | |
LICENSE | 2 years ago | |
README.md | 2 years ago |
Inference framework for MoE-based models, based on a TensorRT custom plugin
named MoELayerPlugin
(including Python binding) that can run inference of MoE layers with any sub-layer on NVIDIA GPUs with minimal memory consumption.
InfMoE is open-sourced under MIT License.
Dependencies:
npz
files)To use TensorRT in Python, you need to first install:
nvidia-tensorrt
, see https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing-pip)Simply you could run python3 -m pip install -r requirements.txt
.
Note: If you install nvidia-tensorrt
from PyPI (but not from downloaded TensorRT package), you MUST ensure the version of TensorRT that MoELayerPlugin
links to matches the version that pip package uses (see site-packages/tensorrt/
). Otherwise the plugin will not work correctly.
Then build this plugin:
cd python
# if you have cuDNN & TensorRT installed in search path, or
python3 setup.py build_ext
# if you need to specify CUDA / cuDNN install location
# (CUDA can only be automatically searched by meson)
python3 setup.py build_ext --tensorrt-prefix=/path/to/tensorrt --cudnn-prefix=/path/to/cudnn
python3 setup.py install .
You can also use bdist_wheel
or other commands provided by setuptools
. You can pass --debug
to build_ext
to enable verbose logging & keep the symbols for debugging purpose.
cd plugin
# if you have cuDNN & TensorRT installed in search path
make builddir && make compile
# if you need to specify CUDA / cuDNN install location
# (CUDA can only be automatically searched by meson)
meson setup build -DWITH_TENSORRT=/path/to/tensorrt -DWITH_CUDNN=/path/to/cudnn
ninja -C builddir # or just run `make`
If everything goes well, you can find libtrtmoelayer.so
in builddir
. Similarly you can pass -DDEBUG=true
to meson setup
for debugging.
When initializing MoELayerPlugin
in TensorRT (either C++ or Python), the following attributes must be specified:
expert_count
: INT32, number of experts (sub-layers)embedding_size
: INT32, the input & output size of expert networkhidden_size
: INT32, the intermediate size of feed forward network (might not be used by sub-layer)max_concurrency
: INT32, maximal concurrent experts in GPU memory (default to 2), setting it too large will lead to OOMexpert_centroids
: FLOAT32 array, weight for dispatching tokens to experts, must be shape (d_model, expert_count)
where d_model
is the last dimension of input tensor (a.k.a. embedding size)expert_weight_file
: null-terminated CHAR array, path to expert weight file, to be read by implmentation of sub-layerexpert_sublayer_type
: null-terminated CHAR array, type of sub-layer used, currently only T5_FF
can be usedmoe_variant
: null-terminated CHAR array, variant type of MoE layer, used to decide different behaviours (can be cpm_2
, base_layer
or default
)layernorm_weight
: FLOAT32 array, weight of layer norm layer applied to input before calculating expert affliation / score, must be provided when moe_variant
is cpm_2
Currently InfMoE can only handle MoE layers with FP32 parameters, input & output. To run inference with a full network, you should slice it before and after any MoE layer:
onnx
/ UFF format and use TensorRT to parse it into a network (Python / C++). Or you can use TensorRT API to construct the network manually (Python / C++).MoELayerPlugin
with Python or C++ (see examples).Then you can concatenate MoE / non-MoE layers to obtain the full network (or replace any specific 'placeholder' layer with MoE layer), which can be later built into a TensorRT CUDA engine and used to run inference with / serialize & dump to file.
We provide several Python examples in python/examples
showing how to do the aforementioned work. You can run them after installing this plugin. You are encouraged to read TensorRT documentation to understand its workflow prior to using this plugin.
InfMoE requires that none of the following tensors contains NaN
values:
It will also check the shape and data type of all parameters, input & output tensors. If any misconfiguration is found, it will print error message to stderr
and abort the whole process.
See CPM-2 paper for scheduling details. To be ported to public source code soon.
We have provided some sublayers in plugin/sublayers
. To implement your own sub-layer, you need to:
MoESubLayer
classMoELayerPlugin.h
(in sublayer_type
) and MoELayerPlugin.cc
(in MoELayerPlugin::createSublayer()
).cpp
only) to meson.build
T5_FF
)This project includes an sublayer implementation of feed-forward layer in T5 network. It is defined as:
hs := hs + dense_relu_dense(layer_norm(hs))
layer_norm(hs) := wl * hs / sqrt(mean(pow(hs, 2)) + eps)
dense_relu_dense(hs) := (gelu(hs @ wi_0^T) * (hs @ wi_1^T)) @ wo^T
where wi_0
, wi_1
and wo
are linear layers with no bias, first converting input tensor to 4 times large (in last dimension) then back.
The given export_weight_file
must be a npz
file containing the following variables (n
varies from 0
to expert_count - 1
): n/layer_norm_weight
, n/wi_0_weight
, n/wi_1_weight
, n/wo_weight
.
Identity
)This layer DOES NOTHING (thus use none of the provided plugin attributes), just copies the input directly to the output. It is intended for debugging purpose only.
“悟道”项目开源算法和工具
Python Text C++ Cuda Shell other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》