#1132 用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理?

Closed
opened 7 months ago by Ghost · 1 comments
Ghost commented 7 months ago
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 AI+视觉特征编码 用mindspore,在启智平台训练任务下,当训到第二次epoch,就会被中断,是怎么回事?该怎样处理? ### 相关环境(GPU/NPU) NPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 pgdtj202309241125101 ### 日志说明或问题截图 Successfully Upload /cache/output to s3:///grampus/job/pgdtj2023092411t114436769/output/models-0/ time="2023-09-24T15:55:24+08:00" level=info msg="clean up child process succeed, pid=33, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T15:55:24+08:00" level=info msg="clean up child process succeed, pid=58, wstatus=9, exit_status=-1" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T15:55:24+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post \"https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/148c384d-cee8-4d8b-9d46-c3b5a50f99a2/tasks/worker-0/reports/training-event\": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T15:55:24+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T15:55:24+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T15:55:25+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service time="2023-09-24T15:55:25+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service time="2023-09-24T15:55:25+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T15:55:25+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-09-24T15:55:25+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 51 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service time="2023-09-24T15:55:25+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection time="2023-09-24T15:55:25+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:216" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url failed ### 期望的解决方案或建议 求帮助解决
liuzx commented 7 months ago
Collaborator
重复issue,参考https://openi.pcl.ac.cn/zeizei/OpenI_Learning/issues/1134
liuzx closed this issue 7 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.