#1135 提示分配内存失败

Closed
created 7 months ago by lihaozhen · 1 comments
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 智算平台训练任务提示无法分配内存 ### 相关环境(GPU/NPU) NPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 lihao202309262030140 ### 日志说明或问题截图 ![image](/attachments/bd8547b1-8181-4191-8f13-160468fc634e) Traceback (most recent call last): File "/cache/code/ant1014/Main.py", line 111, in <module> loss = train_step(data) File "/cache/code/ant1014/Main.py", line 100, in train_step loss_1, grads = grad_fn(data_1) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 594, in staging_specialize out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(*args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 409, in __call__ output = self._graph_executor(tuple(new_inputs), phase) RuntimeError: Fail to alloc memory, size: 24576512, memory statistics: Device HBM memory size: 32768M MindSpore Used memory size: 30686M MindSpore memory base address: 0x120800000000 Total Static Memory size: 30624M Total Dynamic memory size: 45M Dynamic memory size of this graph: 45M ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_manager.cc:52 MallocMemFromMemPool INFO:root:List OBS time cost: 0.16 seconds. download code successfully unzip code successfully INFO:root:Copy parallel total time cost: 0.25 seconds. upload model successfully download system code successfully ### 期望的解决方案或建议 希望能指导一下如何解决这个问题
liuzx commented 7 months ago
Collaborator
看起来是内存不够导致的,可参考减少batch_size,数组大小,线程等措施
liuzx closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.