Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
用mindspore,在启智平台训练任务中,调用model.train传入epoch数、可迭代数据集和回调函数进行训练时产生奇怪的内存复制相关报错,当我更改训练使用的batchsize报错一起变化,调试任务中也曾经出现过这样的问题,debug很久未能解决此问题
相关环境(GPU/NPU)
NPU
相关集群(启智/智算)
智算
任务类型(调试/训练/推理)
训练
任务名
relic202309272223433
日志说明或问题截图
Traceback (most recent call last):
File "/cache/code/tset/JTtest_mindspore_api.py", line 185, in
model.train(epoch=epoches,train_dataset=train_dataset,callbacks=[loss_callback],dataset_sink_mode=False)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 1051, in train
initial_epoch=initial_epoch)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 618, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 909, in _train_process
outputs = self._train_network(*next_element)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 619, in call
out = self.compile_and_run(*args)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 1005, in compile_and_run
self.compile(*inputs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 977, in compile
jit_config_dict=self._jit_config_dict)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1131, in compile
result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
RuntimeError: memcpy_s error, errorno 34, source size 3211264000dest size802816000
In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/_grad/grad_math_ops.py:532
mindspore/core/ir/pattern_matcher.h:782 CalcConstantTensors
time="2023-09-27T23:13:58+08:00" level=info msg="clean up child process succeed, pid=592, wstatus=0, exit_status=0" file="cleaner_unix.go:75" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
INFO:root:List OBS time cost: 0.13 seconds.
文件中附带一张在调试中出现的,当我使用相同的代码减小batch,出现报错,让我联系support engineer
期望的解决方案或建议
希望能给出一些建议或者解决方案,以便训练能够正常进行
日志里报错RuntimeError: memcpy_s error, errorno 34, source size 3211264000dest size802816000,已给出错误信息,这个资源规格是32G,使用超过内存限制了,需要优化代码