#1131 用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理?。返回日志如下

Closed
opened 7 months ago by Ghost · 1 comments
Ghost commented 7 months ago
AI+视觉特征编码比赛 ### 问题描述 用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理?。返回日志如下 ### 相关环境(GPU/NPU) 启智平台训练任务 用mindspore1.10.1-train ### 相关集群(启智/智算) 智算网路 ### 任务类型(调试/训练/推理) 训练任务 ### 任务名 pgdtj202309241644588 ### 日志说明或问题截图 用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理?。返回日志如下 [2] Curr pnsr: 18.2717 Best pnsr: 18.271691938395374 time="2023-09-24T21:16:14+08:00" level=info msg="clean up child process succeed, pid=33, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:14+08:00" level=info msg="clean up child process succeed, pid=58, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:14+08:00" level=info msg="clean up child process succeed, pid=125, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:14+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post \"https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/7acff9c4-f026-4c4f-b9dc-77a03b606dca/tasks/worker-0/reports/training-event\": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:14+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:14+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:15+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service time="2023-09-24T21:16:15+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service time="2023-09-24T21:16:15+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:15+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T21:16:15+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 52 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service time="2023-09-24T21:16:15+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection time="2023-09-24T21:16:15+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url failed ### 期望的解决方案或建议 期望帮助解决问体
liuzx commented 7 months ago
Collaborator
重复issue,参考https://openi.pcl.ac.cn/zeizei/OpenI_Learning/issues/1134
liuzx closed this issue 7 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.