训练任务运行了14小时54分后失败了

问题描述

训练任务在训练了14小时54分失败了

任务类型（调试/训练/推理）

训练

任务名

train02

日志说明或问题截图

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)

return

File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 380, in save
terminate called after throwing an instance of 'c10::Error'
During handling of the above exception, another exception occurred:
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 259, in exit
torch.save(net.state_dict(), model_save_path + '/net_%d.pkl' % e)
Traceback (most recent call last):
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7fd8f8e2eae3 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
self.file_like.write_end_of_file()
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7fda19268ae7 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb5 (0x7fd8f8e2e7f5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #1: + 0x27955b0 (0x7fd8f8e2b5b0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0x2a48ee (0x7fd9ee15e8ee in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
RuntimeError: [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x7fd8f8e2ed55 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x2a35e8 (0x7fd9ee15d5e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0xb45b73 (0x7fd9ee9ffb73 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #18: __libc_start_main + 0xe7 (0x7fda326ddbf7 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
frame #2: + 0x2790b8c (0x7fd8f8e26b8c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
what(): [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024

failed

期望的解决方案或建议

不知道这个问题根源在哪，求解答！谢谢。之前使用启智集群的A100也是训练的这个代码，没有出现这个错误，智算集群就会出现各种错误，感觉还是把启智集群调整回来比较好。

### 问题描述训练任务在训练了14小时54分失败了 ### 相关环境（GPU/NPU） GPU ### 相关集群（启智/智算）智算 ### 任务类型（调试/训练/推理）训练 ### 任务名 train02 ### 日志说明或问题截图 Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) OSError: [Errno 28] No space left on device File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 379, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) return File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 380, in save terminate called after throwing an instance of 'c10::Error' During handling of the above exception, another exception occurred: File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 259, in __exit__ torch.save(net.state_dict(), model_save_path + '/net_%d.pkl' % e) Traceback (most recent call last): frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7fd8f8e2eae3 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) self.file_like.write_end_of_file() frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7fda19268ae7 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so) frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb5 (0x7fd8f8e2e7f5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #1: <unknown function> + 0x27955b0 (0x7fd8f8e2b5b0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #8: <unknown function> + 0x2a48ee (0x7fd9ee15e8ee in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) RuntimeError: [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024 frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x7fd8f8e2ed55 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #7: <unknown function> + 0x2a35e8 (0x7fd9ee15d5e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xb45b73 (0x7fd9ee9ffb73 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #18: __libc_start_main + 0xe7 (0x7fda326ddbf7 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) frame #2: <unknown function> + 0x2790b8c (0x7fd8f8e26b8c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) what(): [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024 <omitting python frames> failed ### 期望的解决方案或建议不知道这个问题根源在哪，求解答！谢谢。之前使用启智集群的A100也是训练的这个代码，没有出现这个错误，智算集群就会出现各种错误，感觉还是把启智集群调整回来比较好。

#1190 训练任务运行了14小时54分后失败了

问题描述

相关环境（GPU/NPU）

相关集群（启智/智算）

任务类型（调试/训练/推理）

任务名

日志说明或问题截图

期望的解决方案或建议