#1118 训练任务中断,原因不明,训练任务状态显示为FAILED

Closed
created 8 months ago by euler0xfff · 2 comments
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 训练任务中断,原因不明,训练任务状态显示为FAILED 训练任务莫名中断,日志报错无法找到任务中断原因。 训练只跑了一些step后就莫名中断,前两次训练任务,日志中显示报错,训练中断,但训练任务状态仍为运行中,我便人为地停止了训练任务防止继续扣除算力积分。后一次训练任务,日志显示报错后,训练任务中断且训练状态显示为FAILED。 ### 相关环境(GPU/NPU) NPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 euler202308301064293 euler202308300649986 euler202308291806625 ### 日志说明或问题截图 该次训练任务在中断后且状态显示为FAILED后,隔半小时后再次打开日志,日志在显示的报错信息后显示了一些训练的输出。 #### 报错信息: ``` time="2023-08-30T18:42:17+08:00" level=info msg="auth file has been updated" file="authentication.go:105" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url time="2023-08-30T18:42:17+08:00" level=info msg="auth file has been updated" file="authentication.go:105" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection time="2023-08-30T18:42:18+08:00" level=info msg="auth info has been updated" file="authentication.go:113" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection time="2023-08-30T18:42:18+08:00" level=info msg="auth info has been updated" file="authentication.go:113" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url ``` ![image](/attachments/9a2aa17c-988e-42bc-aff3-7c114cb6611d) ### 期望的解决方案或建议 期望尽快给出答复,以便能继续使用本平台完成华为举办的“昇腾AI创新大赛2023”参赛作品。 跪求回复!!!
euler0xfff commented 8 months ago
Poster
不知怎么回事,日志截图重复上传了
liuzx commented 8 months ago
Collaborator
内存爆了 ![image](/attachments/eb6f9261-bf35-4137-8948-4857d59abba9)
345 KiB
liuzx closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.