Deleting a branch is permanent. It CANNOT be undone. Continue?
Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
我每次只保留最新的checkpoint,但已经遇到了2次因磁盘空间不足,无法继续训练的情况。
注意,这2次问题都发生在训练最中间,假设总共保存180次,已经成功保存/替换了89次,第90次磁盘空间不足。
这个问题在次2次发生之前,从未出现过,即便对应的checkpoint大得多。
相关环境(GPU/NPU)
GPU v100 × 8
相关集群(启智/智算)
智算
任务类型(调试/训练/推理)
训练
任务名
edwar202305111486064
edwar202305010801819
日志说明或问题截图
期望的解决方案或建议
系统资源选项卡目前有 CPU负载、GPU显存、GPU负载。建议增加硬盘空间。
同时,硬盘空间不足问题也需要排查。
磁盘空间非一个人独享使用,多人使用时可能会出现空间不够。