#1 master

Merged
avadesian merged 35 commits from OpenIOSSG/MNIST_PytorchExample_GPU:master into master 1 year ago
  1. BIN
      Example_picture/基础镜像.png
  2. BIN
      Example_picture/适用A100的基础镜像.png
  3. +92
    -53
      README.md
  4. +72
    -0
      inference.py
  5. +24
    -9
      train.py
  6. +78
    -0
      train_for_c2net.py
  7. +113
    -0
      train_for_multidataset.py

BIN
Example_picture/基础镜像.png View File

Before After
Width: 2052  |  Height: 1282  |  Size: 106 KiB Width: 2071  |  Height: 1265  |  Size: 115 KiB

BIN
Example_picture/适用A100的基础镜像.png View File

Before After
Width: 1499  |  Height: 676  |  Size: 108 KiB

+ 92
- 53
README.md View File

@@ -1,53 +1,92 @@




# 如何在启智平台上进行模型训练 - GPU版本

## 1 概述
- 本项目以#LeNet5-MNIST-PyTorch为例,简要介绍如何在启智AI协同平台上使用Pytorch完成训练任务,旨在为AI初学者提供云脑训练示例。
- 用户可以直接使用本项目提供的数据集和代码文件创建自己的训练任务。

## 2 准备工作
- 启智平台使用准备,本项目需要用户创建启智平台账户,克隆代码到自己的账户,上传数据集,具体操作方法可以通过访问[OpenI_Learning](https://git.openi.org.cn/zeizei/OpenI_Learning)项目学习小白训练营系列课程进行学习。

### 2.1 数据准备
#### 数据集获取
- 如果你需要试运行本示例,则无需再次上传数据集,因为本示例中的数据集MnistDataset_torch.zip已经设置为公开数据集,可以直接引用,数据集也可从本项目的数据集目录中下载并查看数据结构,[MNISTDataset_torch.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0)。
- 数据文件说明
- MNISTData数据集是由10类28∗28的灰度图片组成,训练数据集包含60000张图片,测试数据集包含10000张图片。

#### 数据集上传
- 使用GPU进行训练,需要在GPU芯片上运行,所以上传的数据集需要传到GPU界面。(此步骤在本示例中不需要,可直接选择公开数据集MNISTDataset_torch.zip)
### 2.2 执行脚本准备
#### 示例代码
- 示例代码可从本仓库中下载,[代码下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU)
- 代码文件说明
- [train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py),用于训练的脚本文件。具体说明请参考[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)
- [model.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/model.py),使用的训练网络,在train.py中使用到。

## 3 创建训练任务
- 目前版本的训练任务请选择基础的镜像进行训练,其他镜像由于A100的适配性问题可能无法运行。基础镜像是dockerhub.pcl.ac.cn:5000/user-images/openi:ssbai_torch1.9,包含pytorch1.9,python3.8,cuda11.1
- 准备好数据和执行脚本以后,需要创建训练任务将Pytorch脚本真正运行起来。首次使用的用户可参考本示例代码。

### 使用含有版本为Pytorch1.9和cuda11的镜像,界面截图如下所示。
![avatar](Example_picture/基础镜像.png)


表1 创建训练作业界面参数说明

| 参数名称 | 说明 |
| ----------------- | ----------- |
| 计算资源 | 选择CPU/GPU |
| 代码分支 | 选择仓库代码中要使用的代码分支,默认可选择master分支 |
| 镜像 | 镜像选择已在调试环境中调试好的镜像,目前版本请选择基础镜像:dockerhub.pcl.ac.cn:5000/user-images/openi:ssbai_torch1.9|
| 启动文件 | 启动文件选择代码目录下的启动脚本train.py |
| 数据集 | 数据集选择已上传到启智平台的公共数据集MnistDataset_torch.zip |
| 运行参数 | 增加运行参数可以向脚本中其他参数传值,如epoch_size |
| 资源规格 | 规格选择含有GPU个数的规格|

## 4 查看运行结果
### 4.1 在训练作业界面可以查看运行日志
### 4.2 训练结束后可以下载模型文件
![avatar](Example_picture/结果下载.png)
# 如何在启智平台上进行模型训练 - GPU版本
- 单数据集的训练,多数据集的训练,智算网络的训练,这3个的训练使用方式不同,请注意区分:
- 单数据集的训练示例请参考示例中[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)的代码注释
- 多数据集的训练示例请参考示例中[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)的代码注释
- 智算网络的训练示例请参考示例中[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)的代码注释
- 单数据集和多数据集的区别在于使用方式不同:
如本示例中单数据集MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/下
多数据集时MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/MNISTDataset_torch/下
## 1 概述
- 本项目以#LeNet5-MNIST-PyTorch为例,简要介绍如何在启智AI协同平台上使用Pytorch完成训练任务,包括单数据集的训练,多数据集的训练,智算网络的训练,旨在为AI开发者提供启智训练示例。
- 用户可以直接使用本项目提供的数据集和代码文件创建自己的训练任务。
## 2 准备工作
- 启智平台使用准备,本项目需要用户创建启智平台账户,克隆代码到自己的账户,上传数据集,具体操作方法可以通过访问[OpenI_Learning](https://git.openi.org.cn/zeizei/OpenI_Learning)项目学习小白训练营系列课程进行学习。
### 2.1 数据准备
#### 数据集获取
- 如果你需要试运行本示例,则无需再次上传数据集,因为本示例中的数据集MnistDataset_torch.zip已经设置为公开数据集,可以直接引用,数据集也可从本项目的数据集目录中下载并查看数据结构,[MNISTDataset_torch.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0),[mnist_epoch1_0.73.pkl.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0)。
- 数据文件说明
- MNISTData数据集是由10类28∗28的灰度图片组成,训练数据集包含60000张图片,测试数据集包含10000张图片。
- 数据集压缩包的目录结构如下:
> MNISTDataset_torch.zip
> ├── test
> │ └── MNIST
> │ │── raw
> │ │ ├── t10k-images-idx3-ubyte
> │ │ └── t10k-labels-idx1-ubyte
> │ │ ├── train-images-idx3-ubyte
> │ │ └── train-labels-idx1-ubyte
> │ └── processed
> │ ├── test.pt
> │ └── training.pt
> └── train
> └── MNIST
> │── raw
> │ ├── t10k-images-idx3-ubyte
> │ └── t10k-labels-idx1-ubyte
> │ ├── train-images-idx3-ubyte
> │ └── train-labels-idx1-ubyte
> └── processed
> ├── test.pt
> └── training.pt
> mnist_epoch1_0.73.pkl.zip
> ├── mnist_epoch1_0.73.pkl
#### 数据集上传
使用GPU进行训练,需要在GPU芯片上运行,所以上传的数据集需要传到GPU界面。(此步骤在本示例中不需要,可直接选择公开数据集MNISTDataset_torch.zip)
### 2.2 执行脚本准备
#### 示例代码
- 示例代码可从本仓库中下载,[代码下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU)
- 代码文件说明
- [train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py),用于单数据集训练的脚本文件。具体说明请参考[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)
- [train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py),用于多数据集训练的脚本文件。具体说明请参考[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)
- [train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py),用于智算网络训练的脚本文件。具体说明请参考[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)
- [model.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/model.py),使用的训练网络,在单数据集训练,多数据集训练,智算网络训练中使用到。
## 3 创建训练任务
准备好数据和执行脚本以后,需要创建训练任务将Pytorch脚本运行。首次使用的用户可参考本示例代码。
### 训练界面示例
由于A100的适配性问题,A100需要使用cuda11以上的cuda版本,目前平台已提供基于A100的cuda基础镜像,只需要选择对应的公共镜像:
![avatar](Example_picture/适用A100的基础镜像.png)
训练界面参数参考如下:
![avatar](Example_picture/基础镜像.png)
表1 创建训练作业界面参数说明
| 参数名称 | 说明 |
| ----------------- | ----------- |
| 计算资源 | 选择CPU/GPU |
| 代码分支 | 选择仓库代码中要使用的代码分支,默认可选择master分支 |
| 镜像 | 镜像选择已在调试环境中调试好的镜像,目前版本请选择基础镜像:平台提供基于A100的cuda基础镜像,如dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191|
| 启动文件 | 启动文件选择代码目录下的启动脚本train.py |
| 数据集 | 数据集选择已上传到启智平台的公共数据集MnistDataset_torch.zip |
| 运行参数 | 增加运行参数可以向脚本中其他参数传值,如epoch_size |
| 资源规格 | 规格选择含有GPU个数的规格|
## 4 查看运行结果
### 4.1 在训练作业界面可以查看运行日志
目前训练任务的日志只能在代码中print输出,参考示例train.py代码相关print
### 4.2 训练结束后可以下载模型文件
![avatar](Example_picture/结果下载.png)
## 对于示例代码有任何问题,欢迎在本项目中提issue。

+ 72
- 0
inference.py View File

@@ -0,0 +1,72 @@
#!/usr/bin/python
#coding=utf-8
'''
GPU INFERENCE INSTANCE

If there are Chinese comments in the code,please add at the beginning:
#!/usr/bin/python
#coding=utf-8
Due to the adaptability of a100, please use the recommended image of the
platform with cuda 11.Then adjust the code and submit the image.
The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
In the environment, the uploaded dataset will be automatically placed in the /dataset directory.
if MnistDataset_torch.zip is selected,Then the dataset directory is /dataset/test;

The model file selected is in /model directory.
The result download path is under /result . and the Qizhi platform will provide file downloads under the /result directory.
由于a100的适配性,请使用含cuda 11的平台镜像.
本例中的镜像是dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
选择的数据集被放置在/dataset目录
选择的模型文件放置在/model目录
输出结果路径是/result目录

'''


import numpy as np
import torch
from torchvision.datasets import mnist
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import os


# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
#获取模型文件名称
parser.add_argument('--modelname', help='model name')



if __name__ == '__main__':
args, unknown = parser.parse_known_args()
print('cuda is available:{}'.format(torch.cuda.is_available()))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

test_dataset = mnist.MNIST(root='/dataset/test', train=False, transform=ToTensor(),
download=False)
test_loader = DataLoader(test_dataset, batch_size=256)
#如果文件名确定,model_path可以直接写死
model_path = '/model/'+args.modelname

model = torch.load(model_path).to(device)
model.eval()

correct = 0
_sum = 0

for idx, (test_x, test_label) in enumerate(test_loader):
test_x = test_x
test_label = test_label
predict_y = model(test_x.to(device).float()).detach()
predict_ys = np.argmax(predict_y.cpu(), axis=-1)
label_np = test_label.numpy()
_ = predict_ys == test_label
correct += np.sum(_.numpy(), axis=-1)
_sum += _.shape[0]
print('accuracy: {:.2f}'.format(correct / _sum))
#结果写入/result
filename = 'result.txt'
file_path = os.path.join('/result', filename)
with open(file_path, 'w') as file:
file.write('accuracy: {:.2f}'.format(correct / _sum))

+ 24
- 9
train.py View File

@@ -1,8 +1,23 @@
#!/usr/bin/python
#coding=utf-8
'''
由于a100的适配性问题,使用训练环境前请使用基础镜像dockerhub.pcl.ac.cn:5000/user-images/openi:ssbai_torch1.9在调试环境中调试
自己的代码,并提交镜像,再切到训练环境训练已跑通的代码。
在训练环境中,上传的数据集会自动放在/dataset目录下,模型下载路径默认在/model下,请将模型输出位置指定到/model,启智平台界面才会
提供下载。
If there are Chinese comments in the code,please add at the beginning:
#!/usr/bin/python
#coding=utf-8

Due to the adaptability of a100, before using the training environment, please use the recommended image of the
platform with cuda 11.Then adjust the code and submit the image.
The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
In the training environment, the uploaded dataset will be automatically placed in the /dataset directory.
If it is a single dataset:
if MnistDataset_torch.zip is selected,Then the dataset directory is /dataset/train, /dataset/test;
If it is a multiple dataset:
If MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip are selected,
the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test
and /dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl

The model download path is under /model by default. Please specify the model output location to /model,
and the Qizhi platform will provide file downloads under the /model directory.
'''


@@ -18,15 +33,16 @@ import argparse

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
#数据集位置放在/dataset下
#The dataset location is placed under /dataset
parser.add_argument('--traindata', default="/dataset/train" ,help='path to train dataset')
parser.add_argument('--testdata', default="/dataset/test" ,help='path to test dataset')
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

if __name__ == '__main__':
args = parser.parse_args()
print('cuda is available:{}'.format(torch.cuda.is_available()))
args, unknown = parser.parse_known_args()
#log output
print('cuda is available:{}'.format(torch.cuda.is_available()))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
@@ -65,7 +81,6 @@ if __name__ == '__main__':
_ = predict_ys == test_label
correct += np.sum(_.numpy(), axis=-1)
_sum += _.shape[0]

print('accuracy: {:.2f}'.format(correct / _sum))
#模型输出位置放在/model下
#The model output location is placed under /model
torch.save(model, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))

+ 78
- 0
train_for_c2net.py View File

@@ -0,0 +1,78 @@
#!/usr/bin/python
#coding=utf-8
'''
If there are Chinese comments in the code,please add at the beginning:
#!/usr/bin/python
#coding=utf-8

In the training environment,
the code will be automatically placed in the /tmp/code directory,
the uploaded dataset will be automatically placed in the /tmp/dataset directory, and
the model download path is under /tmp/output by default, please specify the model output location to /tmp/model,
qizhi platform will provide file downloads under the /tmp/output directory.
'''


from model import Model
import numpy as np
import torch
from torchvision.datasets import mnist
from torch.nn import CrossEntropyLoss
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import argparse

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
#The dataset location is placed under /dataset
parser.add_argument('--traindata', default="/tmp/dataset/train" ,help='path to train dataset')
parser.add_argument('--testdata', default="/tmp/dataset/test" ,help='path to test dataset')
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

if __name__ == '__main__':
args, unknown = parser.parse_known_args()
#log output
print('cuda is available:{}'.format(torch.cuda.is_available()))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
model = Model().to(device)
sgd = SGD(model.parameters(), lr=1e-1)
cost = CrossEntropyLoss()
epoch = args.epoch_size
print('epoch_size is:{}'.format(epoch))
for _epoch in range(epoch):
print('the {} epoch_size begin'.format(_epoch + 1))
model.train()
for idx, (train_x, train_label) in enumerate(train_loader):
train_x = train_x.to(device)
train_label = train_label.to(device)
label_np = np.zeros((train_label.shape[0], 10))
sgd.zero_grad()
predict_y = model(train_x.float())
loss = cost(predict_y, train_label.long())
if idx % 10 == 0:
print('idx: {}, loss: {}'.format(idx, loss.sum().item()))
loss.backward()
sgd.step()

correct = 0
_sum = 0
model.eval()
for idx, (test_x, test_label) in enumerate(test_loader):
test_x = test_x
test_label = test_label
predict_y = model(test_x.to(device).float()).detach()
predict_ys = np.argmax(predict_y.cpu(), axis=-1)
label_np = test_label.numpy()
_ = predict_ys == test_label
correct += np.sum(_.numpy(), axis=-1)
_sum += _.shape[0]
print('accuracy: {:.2f}'.format(correct / _sum))
#The model output location is placed under /model
torch.save(model, '/tmp/output/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))

+ 113
- 0
train_for_multidataset.py View File

@@ -0,0 +1,113 @@
#!/usr/bin/python
#coding=utf-8
'''
If there are Chinese comments in the code,please add at the beginning:
#!/usr/bin/python
#coding=utf-8

1,The dataset structure of the multi-dataset in this example
MnistDataset_torch.zip
├── test
└── train
checkpoint_epoch1_0.73.zip
├── mnist_epoch1_0.73.pkl

2,Due to the adaptability of a100, before using the training environment, please use the recommended image of the
platform with cuda 11.Then adjust the code and submit the image.
The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
In the training environment, the uploaded dataset will be automatically placed in the /dataset directory.
Note: the paths are different when selecting a single dataset and multiple datasets.
(1)If it is a single dataset: if MnistDataset_torch.zip is selected,
the dataset directory is /dataset/train, /dataset/test;

The dataset structure of the single dataset in the training image in this example:
dataset
├── test
└── train
(2)If multiple datasets are selected, such as MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip,
the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test
and /dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl

The dataset structure in the training image for multiple datasets in this example:
dataset
├── MnistDataset_torch
| ├── test
| └── train
└── checkpoint_epoch1_0.73
├── mnist_epoch1_0.73.pkl


The model download path is under /model by default. Please specify the model output location to /model,
and the Qizhi platform will provide file downloads under the /model directory.
'''


from model import Model
import numpy as np
import torch
from torchvision.datasets import mnist
from torch.nn import CrossEntropyLoss
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import argparse

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
#The dataset location is placed under /dataset
parser.add_argument('--traindata', default="/dataset/MnistDataset_torch/train" ,help='path to train dataset')
parser.add_argument('--testdata', default="/dataset/MnistDataset_torch/test" ,help='path to test dataset')
parser.add_argument('--checkpoint', default="/dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl" ,help='checkpoint file')
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

if __name__ == '__main__':
args, unknown = parser.parse_known_args()
#log output
print('cuda is available:{}'.format(torch.cuda.is_available()))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
model = Model().to(device)
sgd = SGD(model.parameters(), lr=1e-1)
cost = CrossEntropyLoss()
epoch = args.epoch_size
print('epoch_size is:{}'.format(epoch))
# Load the trained model
# path = args.checkpoint
# checkpoint = torch.load(path, map_location=device)
# model.load_state_dict(checkpoint)
for _epoch in range(epoch):
print('the {} epoch_size begin'.format(_epoch + 1))
model.train()
for idx, (train_x, train_label) in enumerate(train_loader):
train_x = train_x.to(device)
train_label = train_label.to(device)
label_np = np.zeros((train_label.shape[0], 10))
sgd.zero_grad()
predict_y = model(train_x.float())
loss = cost(predict_y, train_label.long())
if idx % 10 == 0:
print('idx: {}, loss: {}'.format(idx, loss.sum().item()))
loss.backward()
sgd.step()

correct = 0
_sum = 0
model.eval()
for idx, (test_x, test_label) in enumerate(test_loader):
test_x = test_x
test_label = test_label
predict_y = model(test_x.to(device).float()).detach()
predict_ys = np.argmax(predict_y.cpu(), axis=-1)
label_np = test_label.numpy()
_ = predict_ys == test_label
correct += np.sum(_.numpy(), axis=-1)
_sum += _.shape[0]
print('accuracy: {:.2f}'.format(correct / _sum))
#The model output location is placed under /model
torch.save(model, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))

Loading…
Cancel
Save