Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
在没有其他进程运行的情况下,安装项目训练所需的包"pip install scikit-learn einops tensorboardX Image torchvision tqdm",开始训练时,发生显存爆炸。
工作人员可以查看我在clannad / Siam-NestedUNet_2项目内,/tmp/code/Siam-NestedUNet/目录下存放的是项目代码,配置文件metadata.json的 "batch_size"的值已经从16改为4,但问题仍未解决
相关环境(GPU/NPU)
V100、A100
相关集群(启智/智算)
无
任务类型(调试/训练/推理)
调试
任务名
A100任务名:ta_dis_a100
V100任务名:ta_dis
日志说明或问题截图
V100卡运行情况:
A100卡运行情况:
,都一致的发生同样问题
期望的解决方案或建议
如何解决显存爆炸问题,以及为什么会发生这样的问题,还有如何在出现问题后,寻找工作人员解决。
尝试复现,发现现在是正常运行。