#18 【bug】:2.6B NPU两节点协同训练"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":

Closed
created 1 month ago by taoht · 2 comments
taoht commented 1 month ago

【运行环境】

PanGu 2.6B
Server: 云脑1:206.136,8端口分片聚合
Client1:云脑2:NPU 8卡/16卡,8端口分片聚合
Client2:云脑2:NPU 8卡/16卡,8端口分片聚合

【错误信息】 稳定复现

  • server节点在第2次参数融合时仅仅收到一个节点上传模型参数,另一节点显示上传失败:
fit_round: strategy sampled 2 clients (out of 2)
fit_round received 2 results and 0 failures
evaluate_round: strategy sampled 2 clients (out of 2)
evaluate_round received 2 results and 0 failures
fit_round: strategy sampled 2 clients (out of 2)
fit_round received 2 results and 0 failures
evaluate_round: strategy sampled 2 clients (out of 2)
evaluate_round received 2 results and 0 failures
fit_round: strategy sampled 2 clients (out of 2)
fit_round received 1 results and 1 failures

  • 失败client节点报错如下:
E0615 23:39:39.792005736 1690 tcp_posix.cc:466] backup_poller:pollset_work: 
{"created":"@1655307579.791889456",
"description":"pollset_work",
"file":"src/core/lib/iomgr/ev_epollex_linux.cc",
"file_line":321,"referenced_errors":
	[{"created":"@1655307579.791888196",
      "description":"Bad file descriptor",
      "errno":9,
      "file":"src/core/lib/iomgr/ev_epollex_linux.cc",
      "file_line":950,
      "os_error":"Bad file descriptor",
      "syscall":"epoll_wait"}]
} 

其中tcp_posix.cc和ev_epollex_linux.cc均来为GRPC文件

【对比测试】

1、两节点使用相同的数据集(规避数据集不同)
2、对比AIsynergy和flwr(规避后续开发引入bug的可能)
3、对比初始化预训练和微调训练(规避训练形式不同的可能影响)

上述3组对比试验均完整复现两节点相同错误:在第二次融合参数时,其中一个client节点报错如上
4、单节点使用16卡对比8卡的训练,同样单节点报错
差异:在第4次融合参数时(而不是第二次),其中一个client节点报错如上
image

【错误问题分析】

https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_posix.cc

https://github.com/facebookresearch/CompilerGym/issues/572
https://github.com/grpc/grpc/issues/22508

tcp_posix.cc:466] backup_poller:pollset_work: 
-> line:464: run_poller()
  GRPC_LOG_IF_ERROR(
      "backup_poller:pollset_work",
      grpc_pollset_work(BACKUP_POLLER_POLLSET(p), nullptr, deadline));

src/core/lib/iomgr/ev_epollex_linux.cc文件已经在2022年3月之后的版本中删除(1.46.0及以后版本)

### 【运行环境】 **PanGu 2.6B** **Server**: 云脑1:206.136,8端口分片聚合 **Client1**:云脑2:NPU 8卡/16卡,8端口分片聚合 **Client2**:云脑2:NPU 8卡/16卡,8端口分片聚合 ### 【错误信息】 稳定复现 * **server节点在第2次参数融合时仅仅收到一个节点上传模型参数,另一节点显示上传失败:** ```python fit_round: strategy sampled 2 clients (out of 2) fit_round received 2 results and 0 failures evaluate_round: strategy sampled 2 clients (out of 2) evaluate_round received 2 results and 0 failures fit_round: strategy sampled 2 clients (out of 2) fit_round received 2 results and 0 failures evaluate_round: strategy sampled 2 clients (out of 2) evaluate_round received 2 results and 0 failures fit_round: strategy sampled 2 clients (out of 2) fit_round received 1 results and 1 failures ``` * **失败client节点报错如下:** ```python E0615 23:39:39.792005736 1690 tcp_posix.cc:466] backup_poller:pollset_work: {"created":"@1655307579.791889456", "description":"pollset_work", "file":"src/core/lib/iomgr/ev_epollex_linux.cc", "file_line":321,"referenced_errors": [{"created":"@1655307579.791888196", "description":"Bad file descriptor", "errno":9, "file":"src/core/lib/iomgr/ev_epollex_linux.cc", "file_line":950, "os_error":"Bad file descriptor", "syscall":"epoll_wait"}] }  ``` 其中tcp_posix.cc和ev_epollex_linux.cc均来为GRPC文件 ### 【对比测试】 **1、两节点使用相同的数据集(规避数据集不同) 2、对比AIsynergy和flwr(规避后续开发引入bug的可能) 3、对比初始化预训练和微调训练(规避训练形式不同的可能影响)** 上述3组对比试验均完整复现两节点相同错误:在第二次融合参数时,其中一个client节点报错如上 **4、单节点使用16卡对比8卡的训练,同样单节点报错** 差异:在第4次融合参数时(而不是第二次),其中一个client节点报错如上 ![image](/attachments/42e53a12-455f-4f8f-a73b-1e6e98ca0ca1) ### 【错误问题分析】 https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_posix.cc https://github.com/facebookresearch/CompilerGym/issues/572 https://github.com/grpc/grpc/issues/22508 ```c tcp_posix.cc:466] backup_poller:pollset_work: -> line:464: run_poller() GRPC_LOG_IF_ERROR( "backup_poller:pollset_work", grpc_pollset_work(BACKUP_POLLER_POLLSET(p), nullptr, deadline)); ``` src/core/lib/iomgr/ev_epollex_linux.cc文件已经在2022年3月之后的版本中删除(1.46.0及以后版本)
taoht commented 1 month ago
Owner

AIsyncore/flwr的最新版本均要求grpcio<=1.43.0,没办法测试grpc>=1.46.0版本是否修复/规避了该bug的发生

AIsyncore/flwr的最新版本均要求grpcio<=1.43.0,没办法测试grpc>=1.46.0版本是否修复/规避了该bug的发生
taoht commented 1 month ago
Owner

【bug解决及对比测试】-->已解决-->GRPC底层原生bug

AIsyncore编译中强制使用grpc==1.46.0版本:
1、从新编译安装Aisyncore(grpcio==1.46.0)
2、启动2NPU节点的协同训练

  • 如下图可看到训练能正常进行
    image
### 【bug解决及对比测试】-->已解决-->GRPC底层原生bug AIsyncore编译中强制使用grpc==1.46.0版本: 1、从新编译安装Aisyncore(grpcio==1.46.0) 2、启动2NPU节点的协同训练 * 如下图可看到训练能正常进行 ![image](/attachments/560584ec-c26d-42a3-b008-8906c8bc5399)
taoht closed this issue 1 month ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.