#1147 没有具体报错信息,启动训练后在解压完数据集后就开始报错,代码在调试任务中可以正常运行

Closed
created 7 months ago by relic · 2 comments
relic commented 7 months ago
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 在调试任务测试代码没问题后,创建训练任务,但是解压完数据集就开始报错,还看不到具体的错误位置,为此修改了十多次没有效果 ### 相关环境(GPU/NPU) NPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 relic202310061681552 ### 日志说明或问题截图 time="2023-10-06T17:52:45+08:00" level=warning msg="report event TrainingExit failed: send training-event info to algorancher failed, err: Post \"https://modelarts.cn-central-231.myhuaweicloud.com/v2/4ffab007f8bc439b965555aa4fe5d9bb/training-jobs/729784d4-4c7a-4e8c-9ed9-f78e4e6242b6/tasks/worker-0/reports/training-event\": dial tcp: lookup modelarts.cn-central-231.myhuaweicloud.com on 10.247.3.10:53: no such host" file="event.go:51" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-06T17:52:45+08:00" level=error msg="bootstrap is exiting with exit code -1" file="bootstrap.go:243" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-06T17:52:45+08:00" level=info msg="retCode -1 has been written to the retCode file /home/ma-user/modelarts/retCode" file="bootstrap.go:221" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-06T17:52:46+08:00" level=info msg="[sidecar] training is completed" Component=ShellScripts Platform=ModelArts-Service time="2023-10-06T17:52:46+08:00" level=info msg="[sidecar] the reason for the failure of the training job is under analysis" Component=ShellScripts Platform=ModelArts-Service time="2023-10-06T17:52:46+08:00" level=warning msg="the log-preview-size parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:195" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-06T17:52:46+08:00" level=error msg="the value -1 of the variable training-return-code is invalid, error: the range of Linux return code is between 0~255" file="cli.go:60" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service time="2023-10-06T17:52:46+08:00" level=info msg="[sidecar] stop toolkit_obs_upload_by_channels_pid = 51 by signal SIGTERM" Component=ShellScripts Platform=ModelArts-Service ### 期望的解决方案或建议 提供一下错误信息或者解决办法,以及如何避免这种平台训练错误
liuzx commented 7 months ago
Collaborator
该问题已反馈给智算分中心的开发人员,等待开发人员定位
liuzx commented 6 months ago
Collaborator
内存问题,目前智算分中心已解决。
liuzx closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.