#18 【bug】:2.6B NPU两节点协同训练"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":

Closed
created 1 year ago by taoht · 2 comments
taoht commented 1 year ago
### 【运行环境】 **PanGu 2.6B** **Server**: 云脑1:206.136,8端口分片聚合 **Client1**:云脑2:NPU 8卡/16卡,8端口分片聚合 **Client2**:云脑2:NPU 8卡/16卡,8端口分片聚合 ### 【错误信息】 稳定复现 * **server节点在第2次参数融合时仅仅收到一个节点上传模型参数,另一节点显示上传失败:** ```python fit_round: strategy sampled 2 clients (out of 2) fit_round received 2 results and 0 failures evaluate_round: strategy sampled 2 clients (out of 2) evaluate_round received 2 results and 0 failures fit_round: strategy sampled 2 clients (out of 2) fit_round received 2 results and 0 failures evaluate_round: strategy sampled 2 clients (out of 2) evaluate_round received 2 results and 0 failures fit_round: strategy sampled 2 clients (out of 2) fit_round received 1 results and 1 failures ``` * **失败client节点报错如下:** ```python E0615 23:39:39.792005736 1690 tcp_posix.cc:466] backup_poller:pollset_work: {"created":"@1655307579.791889456", "description":"pollset_work", "file":"src/core/lib/iomgr/ev_epollex_linux.cc", "file_line":321,"referenced_errors": [{"created":"@1655307579.791888196", "description":"Bad file descriptor", "errno":9, "file":"src/core/lib/iomgr/ev_epollex_linux.cc", "file_line":950, "os_error":"Bad file descriptor", "syscall":"epoll_wait"}] }  ``` 其中tcp_posix.cc和ev_epollex_linux.cc均来为GRPC文件 ### 【对比测试】 **1、两节点使用相同的数据集(规避数据集不同) 2、对比AIsynergy和flwr(规避后续开发引入bug的可能) 3、对比初始化预训练和微调训练(规避训练形式不同的可能影响)** 上述3组对比试验均完整复现两节点相同错误:在第二次融合参数时,其中一个client节点报错如上 **4、单节点使用16卡对比8卡的训练,同样单节点报错** 差异:在第4次融合参数时(而不是第二次),其中一个client节点报错如上 ![image](/attachments/42e53a12-455f-4f8f-a73b-1e6e98ca0ca1) ### 【错误问题分析】 https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_posix.cc https://github.com/facebookresearch/CompilerGym/issues/572 https://github.com/grpc/grpc/issues/22508 ```c tcp_posix.cc:466] backup_poller:pollset_work: -> line:464: run_poller() GRPC_LOG_IF_ERROR( "backup_poller:pollset_work", grpc_pollset_work(BACKUP_POLLER_POLLSET(p), nullptr, deadline)); ``` src/core/lib/iomgr/ev_epollex_linux.cc文件已经在2022年3月之后的版本中删除(1.46.0及以后版本)
taoht commented 1 year ago
Owner
AIsyncore/flwr的最新版本均要求grpcio<=1.43.0,没办法测试grpc>=1.46.0版本是否修复/规避了该bug的发生
taoht commented 1 year ago
Owner
### 【bug解决及对比测试】-->已解决-->GRPC底层原生bug AIsyncore编译中强制使用grpc==1.46.0版本: 1、从新编译安装Aisyncore(grpcio==1.46.0) 2、启动2NPU节点的协同训练 * 如下图可看到训练能正常进行 ![image](/attachments/560584ec-c26d-42a3-b008-8906c8bc5399)
taoht closed this issue 1 year ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.