PCL-陶恒韬 taoht
Loading Heatmap…

taoht closed issue PCL-Platform.Inte.../AISynergy#18

【bug】:2.6B NPU两节点协同训练"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":

1 week ago

taoht commented on issue PCL-Platform.Inte.../AISynergy#19

【大模型应用场景验证】数据独立非同分布场景下2.6B大模型协同训练任务性能对比

## standalone 本地独立训练结果评测 ### 1、PanGu Alpha 2.6B在CMRC2017数据集微调训练, 3个epoch | PanGu 2.6B 模型 | 评测指标| cmrc2017 | PD | | -------- | -------- | -------- | -------- | | cmrc2017 finetune | Acc | 66.62 % | 57.57 % |

1 week ago

taoht opened issue PCL-Platform.Inte.../AISynergy#19

【大模型应用场景验证】数据独立非同分布场景下2.6B大模型协同训练任务性能对比

1 week ago

taoht commented on issue PCL-Platform.Inte.../AISynergy#18

【bug】:2.6B NPU两节点协同训练"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":

### 【bug解决及对比测试】-->已解决-->GRPC底层原生bug AIsyncore编译中强制使用grpc==1.46.0版本: 1、从新编译安装Aisyncore(grpcio==1.46.0) 2、启动2NPU节点的协同训练 * 如下图可看到训练能正常进行 ![image](/attachments/560584ec-c26d-42a3-b008-8906c8bc5399)

2 weeks ago

taoht commented on issue PCL-Platform.Inte.../AISynergy#18

【bug】:2.6B NPU两节点协同训练"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":

AIsyncore/flwr的最新版本均要求grpcio<=1.43.0,没办法测试grpc>=1.46.0版本是否修复/规避了该bug的发生

2 weeks ago

taoht opened issue PCL-Platform.Inte.../AISynergy#18

【bug】:2.6B NPU两节点协同训练"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":

2 weeks ago

taoht closed issue PCL-Platform.Inte.../AISynergy#16

Bug:UnboundLocalError: local variable 'compression' refenenced before assignment

3 weeks ago

taoht commented on issue PCL-Platform.Inte.../AISynergy#16

Bug:UnboundLocalError: local variable 'compression' refenenced before assignment

master分支没有问题,用户使用时不要使用其他开发分支版本

3 weeks ago

taoht opened issue PCL-Platform.Inte.../AISynergy#16

Bug:UnboundLocalError: local variable 'compression' refenenced before assignment

3 weeks ago

taoht closed issue PCL-Platform.Inte.../mPanGu-Alpha-53#1

在docker 容器中使用GPU推理会报错 AttributeError: The 'VocabEmbedding' object has no attribute 'compile_cache'

4 weeks ago

taoht commented on issue PCL-Platform.Inte.../mPanGu-Alpha-53#1

在docker 容器中使用GPU推理会报错 AttributeError: The 'VocabEmbedding' object has no attribute 'compile_cache'

关注、点赞,一键三连![image](/attachments/7a958389-7190-4bb9-b700-b87264288717) 有任何问题,欢迎随时来问,看到有环境条件会及时排查和解答,多谢理解。

4 weeks ago

taoht pushed to master at PCL-Platform.Inte.../mPanGu-Alpha-53

4 weeks ago

taoht commented on issue PCL-Platform.Inte.../mPanGu-Alpha-53#1

在docker 容器中使用GPU推理会报错 AttributeError: The 'VocabEmbedding' object has no attribute 'compile_cache'

**实测推理时用1张卡就行(推荐单卡推理),效率上差不多,修改配置:args_opt.distribute == "false" 或者启动命令如下:** ``` mpirun --allow-run-as-root \ -x PATH \ -x LD_LIBRARY_PATH \ -x PYTHONPATH \ -x NCCL_DEBUG \ -x GLOG_v \ -n 1 \ --hostfile hostfile_1gpus \ --output-filename log_output \ --merge-stderr-to-stdout \ python -s /path/to/predict.py \ --mode 2.6B \ --run_type predict \ **--distribute false** \ --language_idx $LANGUAGE_IDX \ --op_level_model_parallel_num 1 \ --load_ckpt_path /path/to/ckpt_path/ \ --load_ckpt_name /ckpt_name \ --param_init_type "fp16" ``` 如果使用多张卡分布式推理:predict.py第110行增加load_ckpt_path配置(运行策略文件): ``` config = PanguAlphaConfig( load_ckpt_path="/path/to/ckpt_strategy_exp4.ckpt" ) ```

4 weeks ago

taoht pushed to master at PCL-Platform.Inte.../mPanGu-Alpha-53

4 weeks ago

taoht commented on issue PCL-Platform.Inte.../mPanGu-Alpha-53#1

在docker 容器中使用GPU推理会报错 AttributeError: The 'VocabEmbedding' object has no attribute 'compile_cache'

可以先把predict.py文件如图所示修改试试看,不行把完整日志文件粘贴上来,我好帮排查,谢谢。 ![图片](/attachments/70b2a2ed-9b64-4239-b976-79d625bf68bf)

4 weeks ago

taoht upload dataset wiki_mindrecord_test_gpu.zip

1 month ago

taoht pushed to npu at PCL-Platform.Inte.../mPanGu-Alpha-53

1 month ago