Deleting a branch is permanent. It CANNOT be undone. Continue?
Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
AI+视觉特征编码
用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理
相关环境(GPU/NPU)
NPU
相关集群(启智/智算)
智算
任务类型(调试/训练/推理)
训练
任务名
pgdtj202309241125101
日志说明或问题截图
Successfully Upload /cache/output to s3:///grampus/job/pgdtj2023092411t114436769/output/models-0/
time="2023-09-24T15:55:24+08:00" level=info msg="clean up child process succeed, pid=33, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T15:55:24+08:00" level=info msg="clean up child process succeed, pid=58, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T15:55:24+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post "https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/148c384d-cee8-4d8b-9d46-c3b5a50f99a2/tasks/worker-0/reports/training-event": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T15:55:24+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T15:55:24+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T15:55:25+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service
time="2023-09-24T15:55:25+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service
time="2023-09-24T15:55:25+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T15:55:25+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
time="2023-09-24T15:55:25+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 51 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service
time="2023-09-24T15:55:25+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection
time="2023-09-24T15:55:25+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url
failed
期望的解决方案或建议
求帮助解决
求帮助