#1138 用mindspore,在启智平台训练任务中,产生奇怪的内存复制相关报错,当我更改训练使用的batchsize报错一起变化

Closed
created 7 months ago by relic · 1 comments
relic commented 7 months ago
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 用mindspore,在启智平台训练任务中,调用model.train传入epoch数、可迭代数据集和回调函数进行训练时产生奇怪的内存复制相关报错,当我更改训练使用的batchsize报错一起变化,调试任务中也曾经出现过这样的问题,debug很久未能解决此问题 ### 相关环境(GPU/NPU) NPU ### 相关集群(启智/智算) 智算 ### 任务类型(调试/训练/推理) 训练 ### 任务名 relic202309272223433 ### 日志说明或问题截图 Traceback (most recent call last): File "/cache/code/tset/JTtest_mindspore_api.py", line 185, in <module> model.train(epoch=epoches,train_dataset=train_dataset,callbacks=[loss_callback],dataset_sink_mode=False) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 1051, in train initial_epoch=initial_epoch) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper func(self, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 618, in _train self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 909, in _train_process outputs = self._train_network(*next_element) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 619, in __call__ out = self.compile_and_run(*args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 1005, in compile_and_run self.compile(*inputs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 977, in compile jit_config_dict=self._jit_config_dict) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1131, in compile result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode()) RuntimeError: memcpy_s error, errorno 34, source size 3211264000dest size802816000 ---------------------------------------------------- - The Traceback of Net Construct Code: ---------------------------------------------------- # In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/_grad/grad_math_ops.py:532 dx = mul_func(fill_func(dtype(temp), shape_op(x), 2.0), temp) ^ ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/core/ir/pattern_matcher.h:782 CalcConstantTensors time="2023-09-27T23:13:58+08:00" level=info msg="clean up child process succeed, pid=592, wstatus=0, exit_status=0" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service INFO:root:List OBS time cost: 0.13 seconds. 文件中附带一张在调试中出现的,当我使用相同的代码减小batch,出现报错,让我联系support engineer ### 期望的解决方案或建议 希望能给出一些建议或者解决方案,以便训练能够正常进行
liuzx commented 7 months ago
Collaborator
日志里报错RuntimeError: memcpy_s error, errorno 34, source size 3211264000dest size802816000,已给出错误信息,这个资源规格是32G,使用超过内存限制了,需要优化代码
liuzx closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.