Deleting a branch is permanent. It CANNOT be undone. Continue?
Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
AI+视觉特征编码比赛
问题描述
用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理?。返回日志如下
相关环境(GPU/NPU)
启智平台训练任务 用mindspore1.10.1-train
相关集群(启智/智算)
智算网路
任务类型(调试/训练/推理)
训练任务
任务名 pgdtj202309241644588
日志说明或问题截图
用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理?。返回日志如下
[2] Curr pnsr: 18.2717 Best pnsr: 18.271691938395374
time="2023-09-24T21:16:14+08:00" level=info msg="clean up child process succeed, pid=33, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:14+08:00" level=info msg="clean up child process succeed, pid=58, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:14+08:00" level=info msg="clean up child process succeed, pid=125, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:14+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post "https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/7acff9c4-f026-4c4f-b9dc-77a03b606dca/tasks/worker-0/reports/training-event": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:14+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:14+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:15+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service
time="2023-09-24T21:16:15+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service
time="2023-09-24T21:16:15+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:15+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T21:16:15+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 52 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service
time="2023-09-24T21:16:15+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection
time="2023-09-24T21:16:15+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url
failed
期望的解决方案或建议
期望帮助解决问体
重复issue,参考#1134