Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
【运行环境】
PanGu 2.6B
Server: 云脑1:206.136,8端口分片聚合
Client1:云脑2:NPU 8卡/16卡,8端口分片聚合
Client2:云脑2:NPU 8卡/16卡,8端口分片聚合
【错误信息】 稳定复现
其中tcp_posix.cc和ev_epollex_linux.cc均来为GRPC文件
【对比测试】
1、两节点使用相同的数据集(规避数据集不同)
2、对比AIsynergy和flwr(规避后续开发引入bug的可能)
3、对比初始化预训练和微调训练(规避训练形式不同的可能影响)
上述3组对比试验均完整复现两节点相同错误:在第二次融合参数时,其中一个client节点报错如上
4、单节点使用16卡对比8卡的训练,同样单节点报错
差异:在第4次融合参数时(而不是第二次),其中一个client节点报错如上
【错误问题分析】
https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_posix.cc
https://github.com/facebookresearch/CompilerGym/issues/572
https://github.com/grpc/grpc/issues/22508
src/core/lib/iomgr/ev_epollex_linux.cc文件已经在2022年3月之后的版本中删除(1.46.0及以后版本)
AIsyncore/flwr的最新版本均要求grpcio<=1.43.0,没办法测试grpc>=1.46.0版本是否修复/规避了该bug的发生
【bug解决及对比测试】-->已解决-->GRPC底层原生bug
AIsyncore编译中强制使用grpc==1.46.0版本:
1、从新编译安装Aisyncore(grpcio==1.46.0)
2、启动2NPU节点的协同训练