Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
训练前5轮保持正常,之后出现模型训练工具包 (ma-training-toolkit) 在尝试发送训练事件信息到 ModelArts 平台时遇到了网络连接错误
相关环境(GPU/NPU)
NPU
相关集群(启智/智算)
智算
任务类型(调试/训练/推理)
训练
任务名
relic202310021118257
日志说明或问题截图
time="2023-10-03T03:38:13+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post "https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/a8534e7a-49e8-47ff-899c-6c1a9ba0d192/tasks/worker-0/reports/training-event": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-03T03:38:13+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-03T03:38:13+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-03T03:38:13+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service
time="2023-10-03T03:38:13+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service
time="2023-10-03T03:38:13+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-03T03:38:13+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-03T03:38:13+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 53 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service
期望的解决方案或建议
我任务可能是启动文件中训练后输出每一轮损失的日志信息平台传输出现了网络连接错误,希望能给出解决方案
日志报错time="2023-10-03T03:38:13+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service,
与issue #1134 类似,可以参考。这个是内存爆了导致的报错
可是一开始几轮都是正常训练的,运行了大概15个小时后出现了这个错误
内存问题,目前已解决。