#1226 训练过程中使用mindspore.ms_memory_recycle()再此报错Segmentation fault (core dumped)

Closed
created 3 months ago by NoColorZheng · 1 comments
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 网络中包含多个模块,每个模块都继承nn.Cell。训练过程中发现,所占用的内存会不断增大,10个数据,仅仅几个epoch就占用十几个g,最终被系统killed,为了回收内存,防止被系统killed,使用mindspore.ms_memory_recycle()去回收内存,结果运行的时候会报Segmentation fault (core dumped)。 (在本地调试时,运行完mindspore.ms_memory_recycle(),内存回收了,也不会报错,在启智ai平台上训练时,跑完这条命令,过两三秒就会报错) ### 相关环境(GPU/NPU) GPU/CPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 调试 ### 任务名 nocol202401202054280 ### 日志说明或问题截图 net的construct中![image](/attachments/b6a2f0af-0a5e-46a5-8269-77d9908bd398) 报错![image](/attachments/3b04ebc6-d92d-4abb-9653-a0761c73f7ad) ### 期望的解决方案或建议 mindspore.ms_memory_recycle()能发挥作用,回收内存,不报错 ### Steps to reproduce the issue / 重现步骤 1. git clone https://openi.pcl.ac.cn/NoColorZheng/okgr_last.git 2. cd okgr_last 3. python setup.py develop 4.cd okgr_last/pcdet/models/backbones_3d/Chamfer3D 5.python setup.py develop 6. cd okgr_last 7.python train.py
liuzx commented 3 months ago
Collaborator
可以咨询下mindspore官方
liuzx closed this issue 2 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.