#1189 训练任务一直WAITING

Closed
created 4 months ago by pwkQiZhi · 2 comments
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 创建的训练任务一直WAITING,无法RUNNING ### 相关环境(GPU/NPU) GPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 train02 ### 日志说明或问题截图 [FailedScheduling] 2023/12/12 16:12:05 all nodes are unavailable: 19 node(s) resource fit failed, 8 node(s) selector fit queue failed. ### 期望的解决方案或建议 不知道这个问题是怎么回事,是因为GPU资源不够一直在排队吗?
pwkQiZhi commented 4 months ago
Poster
感觉智算集群没有启智集群好用啊,我在智算集群创建训练任务创建了2个多小时任务状态还是WAITING,不明白为什么要撤掉启智集群,唉!
liuzx commented 4 months ago
Collaborator
资源调整,之后是以智算集群为主了。目前已重新上线a100,v100
liuzx closed this issue 4 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.