Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
taoht 6d226e68c2 | 2 years ago | |
---|---|---|
.. | ||
dataset/example | 2 years ago | |
doc | 2 years ago | |
scripts | 2 years ago | |
src | 2 years ago | |
README.md | 2 years ago | |
create_data.py | 2 years ago | |
eval.py | 2 years ago | |
export.py | 2 years ago | |
ma-pre-start.sh | 2 years ago | |
mindspore_hub_conf.py | 2 years ago | |
train.py | 2 years ago |
模型来源于: MindSpore:r1.1>Model_zoo>official>nlp>transformer
混合并行策略基于MindSpore的半自动并行实现
项目已经提供测试数据集,代码可无修改启动训练
MindSpore >= 1.1.1
HuaWei Ascend 910
启动脚本参考MindSpore:r1.1>Model_zoo>official>nlp>transformer中使用方法。
train.py中配置项:
args.distribute = True (单节点为False)
args.Hybrid_Parallel = False
./src/config.py中配置项:
model_parallel = False
batchsize = 128 (128*device_num)
train.py中配置项:
args.distribute = True
args.Hybrid_Parallel = True
./src/config.py中配置项,其中要求dp*mp=device_num
model_parallel = True
dp = 2 (Op-level 算子级数据并行维度)
mp = 2 (Op-level 算子级模型并行维度)
batchsize = 1024 (1024*device_num)
使用设备数量 | 算子切分维度 |
---|---|
1机8卡 | dp=2,mp=4 |
2机16卡 | dp=4,mp=4 |
4机32卡 | dp=4,mp=8 |
8机64卡 | dp=8,mp=8 |
python train.py --distribute True --Hybrid_Parallel True
让模型的训练更有效率(10B以内),支持训练更大规模的模型(>10B、50B、100B),构建支持分布式混合并行的典型模型案例,是该项目的初衷。
Python Shell Perl
Apache-2.0
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》