Ascend处理器环境运行
#通过 python 命令行运行单卡训练脚本。
python train.py \
--train_data=xxx/dataset/waterloo5050step40colorimage/ \
--sigma=15 \
--channel=3 \
--batch_size=32 \
--lr=0.001 \
--use_modelarts=0 \
--output_path=./output/ \
--is_distributed=0 \
--epoch=50 > log.txt 2>&1 &
#通过 bash 命令启动单卡训练。
bash ./scripts/run_train.sh [train_code_path] [train_data] [batch_size] [sigma] [channel] [epoch] [lr]
#上述命令均会使脚本在后台运行,日志将输出到 log.txt,可通过查看该文件了解训练详情
#Ascend多卡训练(2、4、8卡配置请自行修改run_distribute_train.sh,默认8卡)
bash run_distribute_train.sh [train_code_path] [train_data] [batch_size] [sigma] [channel] [epoch] [lr] [rank_table_file_path]
注意:第一次运行时可能会较长时间停留在如下界面,这是因为当一个 epoch 运行完成后才会打印日志,请耐心等待。
单卡运行时第一个 epoch 预计耗时 20 ~ 30 分钟。
2021-05-16 20:12:17,888:INFO:Args:
2021-05-16 20:12:17,888:INFO:--> batch_size: 32
2021-05-16 20:12:17,888:INFO:--> train_data: ../dataset/waterloo5050step40colorimage/
2021-05-16 20:12:17,889:INFO:--> sigma: 15
2021-05-16 20:12:17,889:INFO:--> channel: 3
2021-05-16 20:12:17,889:INFO:--> epoch: 50
2021-05-16 20:12:17,889:INFO:--> lr: 0.001
2021-05-16 20:12:17,889:INFO:--> save_every: 1
2021-05-16 20:12:17,889:INFO:--> pretrain: None
2021-05-16 20:12:17,889:INFO:--> use_modelarts: False
2021-05-16 20:12:17,889:INFO:--> train_url: train_url/
2021-05-16 20:12:17,889:INFO:--> data_url: data_url/
2021-05-16 20:12:17,889:INFO:--> output_path: ./output/
2021-05-16 20:12:17,889:INFO:--> outer_path: s3://output/
2021-05-16 20:12:17,889:INFO:--> device_target: Ascend
2021-05-16 20:12:17,890:INFO:--> is_distributed: 0
2021-05-16 20:12:17,890:INFO:--> rank: 0
2021-05-16 20:12:17,890:INFO:--> group_size: 1
2021-05-16 20:12:17,890:INFO:--> is_save_on_master: 1
2021-05-16 20:12:17,890:INFO:--> ckpt_save_max: 5
2021-05-16 20:12:17,890:INFO:--> rank_save_ckpt_flag: 1
2021-05-16 20:12:17,890:INFO:--> logger: <LOGGER BRDNet (NOTSET)>
2021-05-16 20:12:17,890:INFO:
训练完成后,您可以在 --output_path 参数指定的目录下找到保存的权重文件,训练过程中的部分 loss 收敛情况如下:
# grep "epoch time:" log.txt
epoch time: 1197471.061 ms, per step time: 32.853 ms
epoch time: 1136826.065 ms, per step time: 31.189 ms
epoch time: 1136840.334 ms, per step time: 31.190 ms
epoch time: 1136837.709 ms, per step time: 31.190 ms
epoch time: 1137081.757 ms, per step time: 31.197 ms
epoch time: 1136830.581 ms, per step time: 31.190 ms
epoch time: 1136845.253 ms, per step time: 31.190 ms
epoch time: 1136881.960 ms, per step time: 31.191 ms
epoch time: 1136850.673 ms, per step time: 31.190 ms
epoch: 10 step: 36449, loss is 103.104095
epoch time: 1137098.407 ms, per step time: 31.197 ms
epoch time: 1136794.613 ms, per step time: 31.189 ms
epoch time: 1136742.922 ms, per step time: 31.187 ms
epoch time: 1136842.009 ms, per step time: 31.190 ms
epoch time: 1136792.705 ms, per step time: 31.189 ms
epoch time: 1137056.362 ms, per step time: 31.196 ms
epoch time: 1136863.373 ms, per step time: 31.191 ms
epoch time: 1136842.938 ms, per step time: 31.190 ms
epoch time: 1136839.011 ms, per step time: 31.190 ms
epoch time: 1136879.794 ms, per step time: 31.191 ms
epoch: 20 step: 36449, loss is 61.104546
epoch time: 1137035.395 ms, per step time: 31.195 ms
epoch time: 1136830.626 ms, per step time: 31.190 ms
epoch time: 1136862.117 ms, per step time: 31.190 ms
epoch time: 1136812.265 ms, per step time: 31.189 ms
epoch time: 1136821.096 ms, per step time: 31.189 ms
epoch time: 1137050.310 ms, per step time: 31.196 ms
epoch time: 1136815.292 ms, per step time: 31.189 ms
epoch time: 1136817.757 ms, per step time: 31.189 ms
epoch time: 1136876.477 ms, per step time: 31.191 ms
epoch time: 1136798.538 ms, per step time: 31.189 ms
epoch: 30 step: 36449, loss is 116.179596
epoch time: 1136972.930 ms, per step time: 31.194 ms
epoch time: 1136825.174 ms, per step time: 31.189 ms
epoch time: 1136798.900 ms, per step time: 31.189 ms
epoch time: 1136828.101 ms, per step time: 31.190 ms
epoch time: 1136862.983 ms, per step time: 31.191 ms
epoch time: 1136989.445 ms, per step time: 31.194 ms
epoch time: 1136688.820 ms, per step time: 31.186 ms
epoch time: 1136858.111 ms, per step time: 31.190 ms
epoch time: 1136822.853 ms, per step time: 31.189 ms
epoch time: 1136782.455 ms, per step time: 31.188 ms
epoch: 40 step: 36449, loss is 70.95368
epoch time: 1137042.689 ms, per step time: 31.195 ms
epoch time: 1136797.706 ms, per step time: 31.189 ms
epoch time: 1136817.007 ms, per step time: 31.189 ms
epoch time: 1136861.577 ms, per step time: 31.190 ms
epoch time: 1136698.149 ms, per step time: 31.186 ms
epoch time: 1137052.034 ms, per step time: 31.196 ms
epoch time: 1136809.339 ms, per step time: 31.189 ms
epoch time: 1136851.343 ms, per step time: 31.190 ms
epoch time: 1136761.354 ms, per step time: 31.188 ms
epoch time: 1136837.762 ms, per step time: 31.190 ms
epoch: 50 step: 36449, loss is 87.13184
epoch time: 1137022.554 ms, per step time: 31.195 ms
2021-05-19 14:24:52,695:INFO:training finished....
...
8 卡并行时的 loss 收敛情况:
epoch time: 217708.130 ms, per step time: 47.785 ms
epoch time: 144899.598 ms, per step time: 31.804 ms
epoch time: 144736.054 ms, per step time: 31.768 ms
epoch time: 144737.085 ms, per step time: 31.768 ms
epoch time: 144738.102 ms, per step time: 31.769 ms
epoch: 5 step: 4556, loss is 106.67432
epoch time: 144905.830 ms, per step time: 31.805 ms
epoch time: 144736.539 ms, per step time: 31.768 ms
epoch time: 144734.210 ms, per step time: 31.768 ms
epoch time: 144734.415 ms, per step time: 31.768 ms
epoch time: 144736.405 ms, per step time: 31.768 ms
epoch: 10 step: 4556, loss is 94.092865
epoch time: 144921.081 ms, per step time: 31.809 ms
epoch time: 144735.718 ms, per step time: 31.768 ms
epoch time: 144737.036 ms, per step time: 31.768 ms
epoch time: 144737.733 ms, per step time: 31.769 ms
epoch time: 144738.251 ms, per step time: 31.769 ms
epoch: 15 step: 4556, loss is 99.18075
epoch time: 144921.945 ms, per step time: 31.809 ms
epoch time: 144734.948 ms, per step time: 31.768 ms
epoch time: 144735.662 ms, per step time: 31.768 ms
epoch time: 144733.871 ms, per step time: 31.768 ms
epoch time: 144734.722 ms, per step time: 31.768 ms
epoch: 20 step: 4556, loss is 92.54497
epoch time: 144907.430 ms, per step time: 31.806 ms
epoch time: 144735.713 ms, per step time: 31.768 ms
epoch time: 144733.781 ms, per step time: 31.768 ms
epoch time: 144736.005 ms, per step time: 31.768 ms
epoch time: 144734.331 ms, per step time: 31.768 ms
epoch: 25 step: 4556, loss is 90.98991
epoch time: 144911.420 ms, per step time: 31.807 ms
epoch time: 144734.535 ms, per step time: 31.768 ms
epoch time: 144734.851 ms, per step time: 31.768 ms
epoch time: 144736.346 ms, per step time: 31.768 ms
epoch time: 144734.939 ms, per step time: 31.768 ms
epoch: 30 step: 4556, loss is 114.33954
epoch time: 144915.434 ms, per step time: 31.808 ms
epoch time: 144737.336 ms, per step time: 31.769 ms
epoch time: 144733.943 ms, per step time: 31.768 ms
epoch time: 144734.587 ms, per step time: 31.768 ms
epoch time: 144735.043 ms, per step time: 31.768 ms
epoch: 35 step: 4556, loss is 97.21166
epoch time: 144912.719 ms, per step time: 31.807 ms
epoch time: 144734.795 ms, per step time: 31.768 ms
epoch time: 144733.824 ms, per step time: 31.768 ms
epoch time: 144735.946 ms, per step time: 31.768 ms
epoch time: 144734.930 ms, per step time: 31.768 ms
epoch: 40 step: 4556, loss is 82.41978
epoch time: 144901.017 ms, per step time: 31.804 ms
epoch time: 144735.060 ms, per step time: 31.768 ms
epoch time: 144733.657 ms, per step time: 31.768 ms
epoch time: 144732.592 ms, per step time: 31.767 ms
epoch time: 144731.292 ms, per step time: 31.767 ms
epoch: 45 step: 4556, loss is 77.92129
epoch time: 144909.250 ms, per step time: 31.806 ms
epoch time: 144732.944 ms, per step time: 31.768 ms
epoch time: 144733.161 ms, per step time: 31.768 ms
epoch time: 144732.912 ms, per step time: 31.768 ms
epoch time: 144733.709 ms, per step time: 31.768 ms
epoch: 50 step: 4556, loss is 85.499596
2021-05-19 02:44:44,219:INFO:training finished....