Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
训练任务在训练了14小时54分失败了
相关环境(GPU/NPU)
GPU
相关集群(启智/智算)
智算
任务类型(调试/训练/推理)
训练
任务名
train02
日志说明或问题截图
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 499, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 380, in save
terminate called after throwing an instance of 'c10::Error'
During handling of the above exception, another exception occurred:
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 259, in exit
torch.save(net.state_dict(), model_save_path + '/net_%d.pkl' % e)
Traceback (most recent call last):
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7fd8f8e2eae3 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
self.file_like.write_end_of_file()
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7fda19268ae7 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb5 (0x7fd8f8e2e7f5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #1: + 0x27955b0 (0x7fd8f8e2b5b0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0x2a48ee (0x7fd9ee15e8ee in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
RuntimeError: [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x7fd8f8e2ed55 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x2a35e8 (0x7fd9ee15d5e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0xb45b73 (0x7fd9ee9ffb73 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #18: __libc_start_main + 0xe7 (0x7fda326ddbf7 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
frame #2: + 0x2790b8c (0x7fd8f8e26b8c in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
what(): [enforce fail at inline_container.cc:300] . unexpected pos 7539136 vs 7539024
failed
期望的解决方案或建议
不知道这个问题根源在哪,求解答!谢谢。之前使用启智集群的A100也是训练的这个代码,没有出现这个错误,智算集群就会出现各种错误,感觉还是把启智集群调整回来比较好。
启智集群的gpu资源因为资源调整,已经将gpu资源归于智算集群下。