#1190 训练任务运行了14小时54分后失败了

Closed
created 4 months ago by pwkQiZhi · 1 comments
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 训练任务在训练了14小时54分失败了 ### 相关环境(GPU/NPU) GPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 train02 ### 日志说明或问题截图 Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) OSError: [Errno 28] No space left on device File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 379, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) return File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 380, in save terminate called after throwing an instance of 'c10::Error' During handling of the above exception, another exception occurred: File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 259, in __exit__ torch.save(net.state_dict(), model_save_path + '/net_%d.pkl' % e) Traceback (most recent call last): frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7fd8f8e2eae3 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) self.file_like.write_end_of_file() frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7fda19268ae7 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so) frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb5 (0x7fd8f8e2e7f5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #1: <unknown function> + 0x27955b0 (0x7fd8f8e2b5b0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #8: <unknown function> + 0x2a48ee (0x7fd9ee15e8ee in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) RuntimeError: [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024 frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x7fd8f8e2ed55 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #7: <unknown function> + 0x2a35e8 (0x7fd9ee15d5e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xb45b73 (0x7fd9ee9ffb73 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #18: __libc_start_main + 0xe7 (0x7fda326ddbf7 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) frame #2: <unknown function> + 0x2790b8c (0x7fd8f8e26b8c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) what(): [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024 <omitting python frames> failed ### 期望的解决方案或建议 不知道这个问题根源在哪,求解答!谢谢。之前使用启智集群的A100也是训练的这个代码,没有出现这个错误,智算集群就会出现各种错误,感觉还是把启智集群调整回来比较好。
liuzx commented 4 months ago
Collaborator
启智集群的gpu资源因为资源调整,已经将gpu资源归于智算集群下。
liuzx closed this issue 6 days ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.