#1143 训练前几轮保持正常,之后出现模型训练工具包 (ma-training-toolkit) 在尝试发送训练事件信息到 ModelArts 平台时遇到了网络连接错误

Closed
created 7 months ago by relic · 3 comments
relic commented 7 months ago
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 训练前5轮保持正常,之后出现模型训练工具包 (ma-training-toolkit) 在尝试发送训练事件信息到 ModelArts 平台时遇到了网络连接错误 ### 相关环境(GPU/NPU) NPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 relic202310021118257 ### 日志说明或问题截图 time="2023-10-03T03:38:13+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post \"https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/a8534e7a-49e8-47ff-899c-6c1a9ba0d192/tasks/worker-0/reports/training-event\": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-03T03:38:13+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-03T03:38:13+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-03T03:38:13+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service time="2023-10-03T03:38:13+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service time="2023-10-03T03:38:13+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-03T03:38:13+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-03T03:38:13+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 53 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service ### 期望的解决方案或建议 我任务可能是启动文件中训练后输出每一轮损失的日志信息平台传输出现了网络连接错误,希望能给出解决方案
liuzx commented 7 months ago
Collaborator
日志报错time="2023-10-03T03:38:13+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service, 与issue https://openi.pcl.ac.cn/zeizei/OpenI_Learning/issues/1134 类似,可以参考。这个是内存爆了导致的报错
relic commented 7 months ago
Poster
可是一开始几轮都是正常训练的,运行了大概15个小时后出现了这个错误
liuzx commented 6 months ago
Collaborator
内存问题,目前已解决。
liuzx closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.