HoratioJSY
commented on issue PCL-Platform.Inte.../PanGu-Alpha#23
升腾芯片上自定义训练报错
日志好像传不上去,贴在下面了。。
```
/home/work/user-job-dir
INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b
INFO:root:Using OBS-Python-SDK-3.20.9.1
[Modelarts Service Log]2021-11-26 12:08:20,076 - INFO - background upload stdout log to s3://aix-test/CodePanGu/trainlog/jobf43f4139-job-aix-train-test-0.log
[Modelarts Service Log]2021-11-26 12:08:20,085 - INFO - Ascend Driver: Version=21.0.2
[Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions
[Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id
[Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - Davinci training command
[Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - ['/usr/bin/python', '/home/work/user-job-dir/code/run_code_pangu_train.py', '--data_url=s3://aix-test/CodePanGu/data/', '--train_url=s3://aix-test/CodePanGu/model/V0017/']
[Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - Wait for Rank table file ready
[Modelarts Service Log]2021-11-26 12:08:20,086 - INFO - Rank table file (K8S generated) is ready for read
[Modelarts Service Log]2021-11-26 12:08:20,087 - INFO -
{
"status": "completed",
"group_count": "1",
"group_list": [
{
"group_name": "job-aix-train-test",
"device_count": "8",
"instance_count": "1",
"instance_list": [
{
"pod_name": "jobf43f4139-job-aix-train-test-0",
"server_id": "192.168.27.42",
"devices": [
{
"device_id": "0",
"device_ip": "192.1.99.66"
},
{
"device_id": "1",
"device_ip": "192.2.41.45"
},
{
"device_id": "2",
"device_ip": "192.3.174.48"
},
{
"device_id": "3",
"device_ip": "192.4.84.85"
},
{
"device_id": "4",
"device_ip": "192.1.149.53"
},
{
"device_id": "5",
"device_ip": "192.2.234.20"
},
{
"device_id": "6",
"device_ip": "192.3.16.143"
},
{
"device_id": "7",
"device_ip": "192.4.39.162"
}
]
}
]
}
]
}
[Modelarts Service Log]2021-11-26 12:08:20,087 - INFO - Rank table file (C7x)
[Modelarts Service Log]2021-11-26 12:08:20,087 - INFO -
{
"status": "completed",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "192.168.27.42",
"device": [
{
"device_id": "0",
"device_ip": "192.1.99.66",
"rank_id": "0"
},
{
"device_id": "1",
"device_ip": "192.2.41.45",
"rank_id": "1"
},
{
"device_id": "2",
"device_ip": "192.3.174.48",
"rank_id": "2"
},
{
"device_id": "3",
"device_ip": "192.4.84.85",
"rank_id": "3"
},
{
"device_id": "4",
"device_ip": "192.1.149.53",
"rank_id": "4"
},
{
"device_id": "5",
"device_ip": "192.2.234.20",
"rank_id": "5"
},
{
"device_id": "6",
"device_ip": "192.3.16.143",
"rank_id": "6"
},
{
"device_id": "7",
"device_ip": "192.4.39.162",
"rank_id": "7"
}
]
}
]
}
[Modelarts Service Log]2021-11-26 12:08:20,088 - INFO - Rank table file (C7x) is generated
[Modelarts Service Log]2021-11-26 12:08:20,088 - INFO - Current server
[Modelarts Service Log]2021-11-26 12:08:20,088 - INFO -
{
"server_id": "192.168.27.42",
"device": [
{
"device_id": "0",
"device_ip": "192.1.99.66",
"rank_id": "0"
},
{
"device_id": "1",
"device_ip": "192.2.41.45",
"rank_id": "1"
},
{
"device_id": "2",
"device_ip": "192.3.174.48",
"rank_id": "2"
},
{
"device_id": "3",
"device_ip": "192.4.84.85",
"rank_id": "3"
},
{
"device_id": "4",
"device_ip": "192.1.149.53",
"rank_id": "4"
},
{
"device_id": "5",
"device_ip": "192.2.234.20",
"rank_id": "5"
},
{
"device_id": "6",
"device_ip": "192.3.16.143",
"rank_id": "6"
},
{
"device_id": "7",
"device_ip": "192.4.39.162",
"rank_id": "7"
}
]
}
[Modelarts Service Log]2021-11-26 12:08:20,089 - INFO - bootstrap proc-rank-0-device-0
[Modelarts Service Log]2021-11-26 12:08:20,098 - INFO - proc-rank-0-device-0 (pid: 107)
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-0-device-0.txt: Permission denied
[Modelarts Service Log]2021-11-26 12:08:20,106 - INFO - bootstrap proc-rank-1-device-1
[Modelarts Service Log]2021-11-26 12:08:20,115 - INFO - proc-rank-1-device-1 (pid: 109)
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-1-device-1.txt: Permission denied
[Modelarts Service Log]2021-11-26 12:08:20,121 - INFO - bootstrap proc-rank-2-device-2
[Modelarts Service Log]2021-11-26 12:08:20,130 - INFO - proc-rank-2-device-2 (pid: 111)
[Modelarts Service Log]2021-11-26 12:08:20,136 - INFO - bootstrap proc-rank-3-device-3
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-2-device-2.txt: Permission denied
[Modelarts Service Log]2021-11-26 12:08:20,141 - INFO - proc-rank-3-device-3 (pid: 113)
[Modelarts Service Log]2021-11-26 12:08:20,147 - INFO - bootstrap proc-rank-4-device-4
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-3-device-3.txt: Permission denied
[Modelarts Service Log]2021-11-26 12:08:20,152 - INFO - proc-rank-4-device-4 (pid: 115)
[Modelarts Service Log]2021-11-26 12:08:20,156 - INFO - bootstrap proc-rank-5-device-5
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-4-device-4.txt: Permission denied
[Modelarts Service Log]2021-11-26 12:08:20,161 - INFO - proc-rank-5-device-5 (pid: 117)
[Modelarts Service Log]2021-11-26 12:08:20,164 - INFO - bootstrap proc-rank-6-device-6
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-5-device-5.txt: Permission denied
[Modelarts Service Log]2021-11-26 12:08:20,169 - INFO - proc-rank-6-device-6 (pid: 119)
[Modelarts Service Log]2021-11-26 12:08:20,172 - INFO - bootstrap proc-rank-7-device-7
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-6-device-6.txt: Permission denied
[Modelarts Service Log]2021-11-26 12:08:20,177 - INFO - proc-rank-7-device-7 (pid: 121)
tee: /var/log/batch-task/jobf43f4139/job-aix-train-test/jobf43f4139-proc-rank-7-device-7.txt: Permission denied
2021-11-26 12:08:25,781 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,781 - CodePanGu - INFO - - local_rank:2, device id:2 start to run...
2021-11-26 12:08:25,781 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,781 - CodePanGu - INFO - - local_rank:7, device id:7 start to run...
2021-11-26 12:08:25,781 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,781 - CodePanGu - INFO - - local_rank:5, device id:5 start to run...
2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,782 - CodePanGu - INFO - - local_rank:1, device id:1 start to run...
2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,782 - CodePanGu - INFO - - local_rank:4, device id:4 start to run...
2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,782 - CodePanGu - INFO - - local_rank:3, device id:3 start to run...
2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,782 - CodePanGu - INFO - Training model with no_pipeline mode:
2021-11-26 12:08:25,783 - CodePanGu - INFO - - local_rank:6, device id:6 start to run...
2021-11-26 12:08:25,783 - CodePanGu - INFO - - local_rank:0, device id:0 start to run...
2021-11-26 12:08:26,810 - CodePanGu - INFO - Distributed Training: device_id is 0, rank_id is 0, device_num is 8
2021-11-26 12:08:26,921 - CodePanGu - INFO - Distributed Training: device_id is 4, rank_id is 4, device_num is 8
2021-11-26 12:08:26,977 - CodePanGu - INFO - Distributed Training: device_id is 3, rank_id is 3, device_num is 8
2021-11-26 12:08:26,980 - CodePanGu - INFO - Distributed Training: device_id is 1, rank_id is 1, device_num is 8
2021-11-26 12:08:27,005 - CodePanGu - INFO - Distributed Training: device_id is 2, rank_id is 2, device_num is 8
2021-11-26 12:08:27,078 - CodePanGu - INFO - Distributed Training: device_id is 5, rank_id is 5, device_num is 8
2021-11-26 12:08:27,088 - CodePanGu - INFO - Distributed Training: device_id is 7, rank_id is 7, device_num is 8
2021-11-26 12:08:27,117 - CodePanGu - INFO - Distributed Training: device_id is 6, rank_id is 6, device_num is 8
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_4
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_0
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_3
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:['/cache/user-job-dir/workspace/device4/data/rank_4/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device4/data/rank_4/random_span_valid_seq_256.mindrecords']
[WARNING] ME(115:281472947853904,MainProcess):2021-11-26-12:08:30.264.457 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_5
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
before model.train() ready
INFO:CodePanGu:['/cache/user-job-dir/workspace/device0/data/rank_0/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device0/data/rank_0/random_span_valid_seq_256.mindrecords']
[WARNING] ME(107:281472880814672,MainProcess):2021-11-26-12:08:30.305.398 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_1
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
before model.train() ready
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_6
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_7
INFO:CodePanGu:data has been copied from s3://aix-test/CodePanGu/data/ to data/rank_2
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:===config is: [PANGUALPHAConfig]==============================
batch_size:16
seq_length:256
vocab_size:15928
embedding_size:1024
num_layers:3
num_heads:16
expand_ratio:4
post_layernorm_residual:False
dropout_rate:0.1
compute_dtype:Float16
use_past:False
dp:8
mp:1
self_layernorm:True
forward_reduce_scatter:True
stage_num:1
micro_size:1
word_emb_dp:True
eod_reset:False
load_ckpt_path:None
==========
INFO:CodePanGu:['/cache/user-job-dir/workspace/device3/data/rank_3/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device3/data/rank_3/random_span_valid_seq_256.mindrecords']
[WARNING] ME(113:281473367132752,MainProcess):2021-11-26-12:08:30.435.622 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
before model.train() ready
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:['/cache/user-job-dir/workspace/device5/data/rank_5/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device5/data/rank_5/random_span_valid_seq_256.mindrecords']
[WARNING] ME(117:281473076857424,MainProcess):2021-11-26-12:08:30.525.404 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
before model.train() ready
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:=====args_opt is: Namespace(data_url='s3://aix-test/CodePanGu/data/', decay_steps=1000, distribute='true', embedding_size=1024, end_lr=6e-06, epoch_size=1, local_data='data', micro_size=1, mode='self_define', num_heads=16, num_layers=3, optimizer='adam', optimizer_shard=0, per_batch_size=2, run_type='train', save_dir='saved_model', save_steps=1000, seq_length=256, sink_size=2, stage_num=1, start_lr=6e-05, tensor_model_parallel_num=1, train_url='s3://aix-test/CodePanGu/model/V0017/', vocab_size=15928, warmup_step=200, weight_decay=0.1)
INFO:CodePanGu:['/cache/user-job-dir/workspace/device7/data/rank_7/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device7/data/rank_7/random_span_valid_seq_256.mindrecords']
INFO:CodePanGu:['/cache/user-job-dir/workspace/device1/data/rank_1/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device1/data/rank_1/random_span_valid_seq_256.mindrecords']
[WARNING] ME(121:281473738144336,MainProcess):2021-11-26-12:08:30.642.848 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
[WARNING] ME(109:281473274567248,MainProcess):2021-11-26-12:08:30.643.468 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
INFO:CodePanGu:['/cache/user-job-dir/workspace/device2/data/rank_2/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device2/data/rank_2/random_span_valid_seq_256.mindrecords']
INFO:CodePanGu:['/cache/user-job-dir/workspace/device6/data/rank_6/random_span_train_seq_256.mindrecords', '/cache/user-job-dir/workspace/device6/data/rank_6/random_span_valid_seq_256.mindrecords']
[WARNING] ME(111:281473143364176,MainProcess):2021-11-26-12:08:30.649.122 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
[WARNING] ME(119:281473708223056,MainProcess):2021-11-26-12:08:30.649.345 [mindspore/dataset/engine/datasets.py:3463] WARN: global shuffle is not used.
before model.train() ready
before model.train() ready
before model.train() ready
before model.train() ready
[Modelarts Service Log]2021-11-26 12:08:42,201 - ERROR - proc-rank-0-device-0 (pid: 107) has exited with non-zero code: -11
[Modelarts Service Log]2021-11-26 12:08:42,201 - INFO - Begin destroy training processes
[Modelarts Service Log]2021-11-26 12:08:42,201 - INFO - proc-rank-7-device-7 (pid: 121) has exited
[Modelarts Service Log]2021-11-26 12:08:42,202 - INFO - proc-rank-6-device-6 (pid: 119) has exited
[Modelarts Service Log]2021-11-26 12:08:42,202 - INFO - proc-rank-5-device-5 (pid: 117) has exited
[Modelarts Service Log]2021-11-26 12:08:42,202 - INFO - proc-rank-4-device-4 (pid: 115) has exited
[Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-3-device-3 (pid: 113) has exited
[Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-2-device-2 (pid: 111) has exited
[Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-1-device-1 (pid: 109) has exited
[Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - proc-rank-0-device-0 (pid: 107) has exited
[Modelarts Service Log]2021-11-26 12:08:42,203 - INFO - End destroy training processes
```
2 years ago