Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
智算平台训练任务提示无法分配内存
相关环境(GPU/NPU)
NPU
相关集群(启智/智算)
智算
任务类型(调试/训练/推理)
训练
任务名
lihao202309262030140
日志说明或问题截图
Traceback (most recent call last):
File "/cache/code/ant1014/Main.py", line 111, in
loss = train_step(data)
File "/cache/code/ant1014/Main.py", line 100, in train_step
loss_1, grads = grad_fn(data_1)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 594, in staging_specialize
out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(*args)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper
results = fn(*arg, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 409, in call
output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: Fail to alloc memory, size: 24576512, memory statistics:
Device HBM memory size: 32768M
MindSpore Used memory size: 30686M
MindSpore memory base address: 0x120800000000
Total Static Memory size: 30624M
Total Dynamic memory size: 45M
Dynamic memory size of this graph: 45M
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_manager.cc:52 MallocMemFromMemPool
INFO:root:List OBS time cost: 0.16 seconds.
download code successfully
unzip code successfully
INFO:root:Copy parallel total time cost: 0.25 seconds.
upload model successfully
download system code successfully
期望的解决方案或建议
希望能指导一下如何解决这个问题
看起来是内存不够导致的,可参考减少batch_size,数组大小,线程等措施