Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
在调试任务测试代码没问题后,创建训练任务,但是解压完数据集就开始报错,还看不到具体的错误位置,为此修改了十多次没有效果
相关环境(GPU/NPU)
NPU
相关集群(启智/智算)
智算
任务类型(调试/训练/推理)
训练
任务名
relic202310061681552
日志说明或问题截图
time="2023-10-06T17:52:45+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post "https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/729784d4-4c7a-4e8c-9ed9-f78e4e6242b6/tasks/worker-0/reports/training-event": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-06T17:52:45+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-06T17:52:45+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-06T17:52:46+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service
time="2023-10-06T17:52:46+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service
time="2023-10-06T17:52:46+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-06T17:52:46+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-10-06T17:52:46+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 51 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service
期望的解决方案或建议
提供一下错误信息或者解决办法,以及如何避免这种平台训练错误
该问题已反馈给智算分中心的开发人员,等待开发人员定位
内存问题,目前智算分中心已解决。