HoratioJSY

HoratioJSY closed issue PCL-Platform.Inte.../PanGu-Alpha#23

升腾芯片上自定义训练报错

2 years ago

HoratioJSY commented on issue PCL-Platform.Inte.../PanGu-Alpha#23

升腾芯片上自定义训练报错

okay，API对齐了，现在大概能跑的动了。

2 years ago

HoratioJSY commented on issue PCL-Platform.Inte.../PanGu-Alpha#23

升腾芯片上自定义训练报错

镜像应该是Ascend-Powered-Engine 1.0, 里面是MindSpore 1.3.0。如果设置GLog=0能打印更详细的信息，不过也看不出哪有问题，在Modelarts提交“训练作业”，好像所有节点的信息都打印在一起了。我再看看能不能在命令行上运行，打印更多的信息吧。

2 years ago

HoratioJSY commented on issue PCL-Platform.Inte.../PanGu-Alpha#23

升腾芯片上自定义训练报错

日志好像传不上去，贴在下面了。。 ``` /home/work/user-job-dir INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 [Modelarts Service Log]2021-11-26 12:08:20,076 - INFO - background upload stdout log to s3://aix-test/CodePanGu/trainlog/jobf43f4139-job-aix-train-test-0.log [Modelarts Service Log]2021-11-26 12:08:20,085 - INFO - Ascend Driver: Version=21.0.2 [Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions [Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id [Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - Davinci training command [Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - ['/usr/bin/python', '/home/work/user-job-dir/code/run_code_pangu_train.py', '--data_url=s3://aix-test/CodePanGu/data/', '--train_url=s3://aix-test/CodePanGu/model/V0017/'] [Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - Wait for Rank table file ready [Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - Rank table file (K8S generated) is ready for read [Modelarts Service Log]2021-11-26 12:08:20,087 - INFO - { "status": "completed", "group_count": "1", "group_list": [ { "group_name": "job-aix-train-test", "device_count": "8", "instance_count": "1", "instance_list": [ { "pod_name": "jobf43f4139-job-aix-train-test-0", "server_id": "192.168.27.42", "devices": [ { "device_id": "0", "device_ip": "192.1.99.66" }, { "device_id": "1", "device_ip": "192.2.41.45" }, { "device_id": "2", "device_ip": "192.3.174.48" }, { "device_id": "3", "device_ip": "192.4.84.85" }, { "device_id": "4", "device_ip": "192.1.149.53" }, { "device_id": "5", "device_ip": "192.2.234.20" }, { "device_id": "6", "device_ip": "192.3.16.143" }, { "device_id": "7", "device_ip": "192.4.39.162" } ] } ] } ] } [Modelarts Service Log]2021-11-26 12:08:20,087 - INFO - Rank table file (C7x) [Modelarts Service Log]2021-11-26 12:08:20,087 - INFO - { "status": "completed", "version": "1.0", "server_count": "1", "server_list": [ { "server_id": "192.168.27.42", "device": [ { "device_id": "0", "device_ip": "192.1.99.66", "rank_id": "0" }, { "device_id": "1", "device_ip": "192.2.41.45", "rank_id": "1" }, { "device_id": "2", "device_ip": "192.3.174.48", "rank_id": "2" }, { "device_id": "3", "device_ip": "192.4.84.85", "rank_id": "3" }, { "device_id": "4", "device_ip": "192.1.149.53", "rank_id": "4" }, { "device_id": "5", "device_ip": "192.2.234.20", "rank_id": "5" }, { "device_id": "6", "device_ip": "192.3.16.143", "rank_id": "6" }, { "device_id": "7", "device_ip": "192.4.39.162", "rank_id": "7" } ] } ] } [Modelarts Service Log]2021-11-26 12:08:20,088 - INFO - Rank table file (C7x) is generated [Modelarts Service Log]2021-11-26 12:08:20,088 - INFO - Current server [Modelarts Service Log]2021-11-26 12:08:20,088 - INFO - { "server_id": "192.168.27.42", "device": [ { "device_id": "0", "device_ip": "192.1.99.66", "rank_id": "0" }, { "device_id": "1", "device_ip": "192.2.41.45", "rank_id": "1" }, { "device_id": "2", "device_ip": "192.3.174.48", "rank_id": "2" }, { "device_id": "3", "device_ip": "192.4.84.85", "rank_id": "3" }, { "device_id": "4", "device_ip": "192.1.149.53", "rank_id": "4" }, { "device_id": "5", "device_ip": "192.2.234.20", "rank_id": "5" }, { "device_id": "6", "device_ip": "192.3.16.143", "rank_id": "6" }, { "device_id": "7", "device_ip": "192.4.39.162", "rank_id": "7" } ] } [Modelarts Service Log]2021-11-26 12:08:20,089 - INFO - bootstrap proc-rank-0-device-0 [Modelarts Service Log]2021-11-26 12:08:20,098 - INFO - proc-rank-0-device-0 (pid: 107) tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-0-device-0.txt: Permission denied [Modelarts Service Log]2021-11-26 12:08:20,106 - INFO - bootstrap proc-rank-1-device-1 [Modelarts Service Log]2021-11-26 12:08:20,115 - INFO - proc-rank-1-device-1 (pid: 109) tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-1-device-1.txt: Permission denied [Modelarts Service Log]2021-11-26 12:08:20,121 - INFO - bootstrap proc-rank-2-device-2 [Modelarts Service Log]2021-11-26 12:08:20,130 - INFO - proc-rank-2-device-2 (pid: 111) [Modelarts Service Log]2021-11-26 12:08:20,136 - INFO - bootstrap proc-rank-3-device-3 tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-2-device-2.txt: Permission denied [Modelarts Service Log]2021-11-26 12:08:20,141 - INFO - proc-rank-3-device-3 (pid: 113) [Modelarts Service Log]2021-11-26 12:08:20,147 - INFO - bootstrap proc-rank-4-device-4 tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-3-device-3.txt: Permission denied [Modelarts Service Log]2021-11-26 12:08:20,152 - INFO - proc-rank-4-device-4 (pid: 115) [Modelarts Service Log]2021-11-26 12:08:20,156 - INFO - bootstrap proc-rank-5-device-5 tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-4-device-4.txt: Permission denied [Modelarts Service Log]2021-11-26 12:08:20,161 - INFO - proc-rank-5-device-5 (pid: 117) [Modelarts Service Log]2021-11-26 12:08:20,164 - INFO - bootstrap proc-rank-6-device-6 tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-5-device-5.txt: Permission denied [Modelarts Service Log]2021-11-26 12:08:20,169 - INFO - proc-rank-6-device-6 (pid: 119) [Modelarts Service Log]2021-11-26 12:08:20,172 - INFO - bootstrap proc-rank-7-device-7 tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-6-device-6.txt: Permission denied [Modelarts Service Log]2021-11-26 12:08:20,177 - INFO - proc-rank-7-device-7 (pid: 121) tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-7-device-7.txt: Permission denied 2021-11-26 12:08:25,781 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,781 - CodePanGu - INFO - - local_rank:2, device id:2 start to run... 2021-11-26 12:08:25,781 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,781 - CodePanGu - INFO - - local_rank:7, device id:7 start to run... 2021-11-26 12:08:25,781 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,781 - CodePanGu - INFO - - local_rank:5, device id:5 start to run... 2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,782 - CodePanGu - INFO - - local_rank:1, device id:1 start to run... 2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,782 - CodePanGu - INFO - - local_rank:4, device id:4 start to run... 2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,782 - CodePanGu - INFO - - local_rank:3, device id:3 start to run... 2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode: 2021-11-26 12:08:25,783 - CodePanGu - INFO - - local_rank:6, device id:6 start to run... 2021-11-26 12:08:25,783 - CodePanGu - INFO - - local_rank:0, device id:0 start to run... 2021-11-26 12:08:26,810 - CodePanGu - INFO - Distributed Training: device_id is 0, rank_id is 0, device_num is 8 2021-11-26 12:08:26,921 - CodePanGu - INFO - Distributed Training: device_id is 4, rank_id is 4, device_num is 8 2021-11-26 12:08:26,977 - CodePanGu - INFO - Distributed Training: device_id is 3, rank_id is 3, device_num is 8 2021-11-26 12:08:26,980 - CodePanGu - INFO - Distributed Training: device_id is 1, rank_id is 1, device_num is 8 2021-11-26 12:08:27,005 - CodePanGu - INFO - Distributed Training: device_id is 2, rank_id is 2, device_num is 8 2021-11-26 12:08:27,078 - CodePanGu - INFO - Distributed Training: device_id is 5, rank_id is 5, device_num is 8 2021-11-26 12:08:27,088 - CodePanGu - INFO - Distributed Training: device_id is 7, rank_id is 7, device_num is 8 2021-11-26 12:08:27,117 - CodePanGu - INFO - Distributed Training: device_id is 6, rank_id is 6, device_num is 8 INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_4 INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_0 INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_3 INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:['/cache/user-job-dir/workspace/device4/data/rank_4/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device4/data/rank_4/random_span_valid_seq_256.mindrecords'] [WARNING] ME(115:281472947853904,MainProcess):2021-11-26-12:08:30.264.457 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_5 INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== before model.train() ready INFO:CodePanGu:['/cache/user-job-dir/workspace/device0/data/rank_0/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device0/data/rank_0/random_span_valid_seq_256.mindrecords'] [WARNING] ME(107:281472880814672,MainProcess):2021-11-26-12:08:30.305.398 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_1 INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== before model.train() ready INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_6 INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_7 INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_2 INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:===config is: [PANGUALPHAConfig]============================== batch_size:16 seq_length:256 vocab_size:15928 embedding_size:1024 num_layers:3 num_heads:16 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:8 mp:1 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:1 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== INFO:CodePanGu:['/cache/user-job-dir/workspace/device3/data/rank_3/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device3/data/rank_3/random_span_valid_seq_256.mindrecords'] [WARNING] ME(113:281473367132752,MainProcess):2021-11-26-12:08:30.435.622 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. before model.train() ready INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:['/cache/user-job-dir/workspace/device5/data/rank_5/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device5/data/rank_5/random_span_valid_seq_256.mindrecords'] [WARNING] ME(117:281473076857424,MainProcess):2021-11-26-12:08:30.525.404 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. before model.train() ready INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1) INFO:CodePanGu:['/cache/user-job-dir/workspace/device7/data/rank_7/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device7/data/rank_7/random_span_valid_seq_256.mindrecords'] INFO:CodePanGu:['/cache/user-job-dir/workspace/device1/data/rank_1/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device1/data/rank_1/random_span_valid_seq_256.mindrecords'] [WARNING] ME(121:281473738144336,MainProcess):2021-11-26-12:08:30.642.848 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. [WARNING] ME(109:281473274567248,MainProcess):2021-11-26-12:08:30.643.468 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. INFO:CodePanGu:['/cache/user-job-dir/workspace/device2/data/rank_2/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device2/data/rank_2/random_span_valid_seq_256.mindrecords'] INFO:CodePanGu:['/cache/user-job-dir/workspace/device6/data/rank_6/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device6/data/rank_6/random_span_valid_seq_256.mindrecords'] [WARNING] ME(111:281473143364176,MainProcess):2021-11-26-12:08:30.649.122 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. [WARNING] ME(119:281473708223056,MainProcess):2021-11-26-12:08:30.649.345 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used. before model.train() ready before model.train() ready before model.train() ready before model.train() ready [Modelarts Service Log]2021-11-26 12:08:42,201 - ERROR - proc-rank-0-device-0 (pid: 107) has exited with non-zero code: -11 [Modelarts Service Log]2021-11-26 12:08:42,201 - INFO - Begin destroy training processes [Modelarts Service Log]2021-11-26 12:08:42,201 - INFO - proc-rank-7-device-7 (pid: 121) has exited [Modelarts Service Log]2021-11-26 12:08:42,202 - INFO - proc-rank-6-device-6 (pid: 119) has exited [Modelarts Service Log]2021-11-26 12:08:42,202 - INFO - proc-rank-5-device-5 (pid: 117) has exited [Modelarts Service Log]2021-11-26 12:08:42,202 - INFO - proc-rank-4-device-4 (pid: 115) has exited [Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-3-device-3 (pid: 113) has exited [Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-2-device-2 (pid: 111) has exited [Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-1-device-1 (pid: 109) has exited [Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-0-device-0 (pid: 107) has exited [Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - End destroy training processes ```

2 years ago

HoratioJSY opened issue PCL-Platform.Inte.../PanGu-Alpha#23

升腾芯片上自定义训练报错

2 years ago

HoratioJSY commented on issue PCL-Platform.Inte.../PanGu-Alpha#7

NVIDIA V100 单卡运行推理， AttributeError: 'Dropout' object has no attribute 'dropout_gen_mask'.

Hi，我在Ascend上运行也会报这个错：'Dropout' object has no attribute 'dropout_gen_mask' ``` INFO:Training model with standard mode: INFO: - local_rank:0, device id:0 start to run... INFO:Distributed Training: device_id is 0, rank_id is 0, device_num is 1 INFO:===config is: [PANGUALPHAConfig]============================== batch_size:4 seq_length:256 vocab_size:15928 embedding_size:512 num_layers:6 num_heads:8 expand_ratio:4 post_layernorm_residual:False dropout_rate:0.1 compute_dtype:Float16 use_past:False dp:0 mp:4 self_layernorm:True forward_reduce_scatter:True stage_num:1 micro_size:16 word_emb_dp:True eod_reset:False load_ckpt_path:None ========== [WARNING] ME(850:281473686031968,MainProcess):2021-11-24-06:18:15.269.472 [mindspore/common/_decorator.py:33] 'GatherV2' is deprecated from version 1.1 and will be removed in a future version, use 'Gather' instead. [WARNING] ME(850:281473686031968,MainProcess):2021-11-24-06:18:15.276.926 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead. [WARNING] ME(850:281473686031968,MainProcess):2021-11-24-06:18:15.278.269 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead. [WARNING] ME(850:281473686031968,MainProcess):2021-11-24-06:18:15.281.198 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead. [WARNING] ME(850:281473686031968,MainProcess):2021-11-24-06:18:15.282.496 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead. [WARNING] ME(850:281473686031968,MainProcess):2021-11-24-06:18:15.287.459 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead. [WARNING] ME(850:281473686031968,MainProcess):2021-11-24-06:18:15.290.675 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead. --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-7-4e527a9a05d2> in <module> ----> 1 run_train(args_opt) ~/work/pangu_alpha_train.py in run_train(args_opt) 341 run_train_pipeline(args_opt) 342 else: --> 343 run_train_no_pipeline(args_opt) 344 345 ~/work/pangu_alpha_train.py in run_train_no_pipeline(args_opt) 271 word_emb_dp=True) 272 logger.info(f"===config is: {config}") --> 273 pangu_alpha = PANGUALPHA(config) 274 loss = CrossEntropyLoss(config) 275 pangu_alpha_with_loss = PANGUALPHAWithLoss(config, pangu_alpha, loss) ~/work/pangu_alpha.py in __init__(self, config) 897 def __init__(self, config): 898 super(PANGUALPHA, self).__init__() --> 899 self.backbone = PANGUALPHA_Model(config) 900 self.head = PANGUALPHA_Head(config) 901 ~/work/pangu_alpha.py in __init__(self, config) 732 733 for i in range(num_layers): --> 734 per_block = Block(config, i + 1).set_comm_fusion(int(i / fusion_group_size) + 2) 735 per_block.recompute() 736 per_block.attention.dropout.dropout_gen_mask.recompute(False) ~/work/pangu_alpha.py in __init__(self, config, layer_idx) 458 self.layernorm2.layer_norm.shard(((config.dp, 1, 1), (1,), (1,))) 459 --> 460 self.attention = Attention(config, scale, layer_idx) 461 self.output = Output(config, scale) 462 self.post_layernorm_residual = config.post_layernorm_residual ~/work/pangu_alpha.py in __init__(self, config, scale, layer_idx) 284 self.use_past = config.use_past 285 self.dropout = nn.Dropout(1 - config.dropout_rate) --> 286 self.dropout.dropout_gen_mask.shard(((config.dp, 1, 1),)) 287 self.dropout.dropout_do_mask.shard(((config.dp, 1, 1),)) 288 self.prob_dropout = nn.Dropout(1 - config.dropout_rate) ~/miniconda3/envs/MindSpore-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py in __getattr__(self, name) 286 para_list = ParameterTuple(cast_list) 287 return para_list --> 288 raise AttributeError("'{}' object has no attribute '{}'.".format(type(self).__name__, name)) 289 290 def __del__(self): AttributeError: 'Dropout' object has no attribute 'dropout_gen_mask'. ```

2 years ago