#4 master

Merged
junan merged 47 commits from OpenIOSSG/MNIST_PytorchExample_GPU:master into master 1 year ago
  1. +69
    -49
      README.md
  2. +5
    -1
      inference.py
  3. +125
    -0
      pretrain.py
  4. +141
    -0
      pretrain_for_c2net.py
  5. +0
    -1
      test.py
  6. +29
    -1
      train.py
  7. +40
    -2
      train_for_c2net.py

+ 69
- 49
README.md View File

@@ -1,92 +1,112 @@




# 如何在启智平台上进行模型训练 - GPU版本

- 启智集群单数据集的训练,启智集群多数据集的训练,智算集群的单数据集训练,这3个的训练使用方式不同,请注意区分:

- 启智集群单数据集的训练示例请参考示例中[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)的代码注释
- 启智集群单数据集**加载模型**的训练示例请参考示例中[pretrain.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/pretrain.py)的代码注释
- 启智集群多数据集的训练示例请参考示例中[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)的代码注释
- 智算集群单数据集的训练示例请参考示例中[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)的代码注释
- 智算集群单数据集**加载模型**的训练示例请参考示例中[pretrain_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/pretrain_for_c2net.py)的代码注释
- 启智集群中单数据集和多数据集的区别在于使用方式不同:
如本示例中单数据集MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/下
多数据集时MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/MNISTDataset_torch/下
如本示例中单数据集MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/下
多数据集时MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/MNISTDataset_torch/下
- 智算网络中,若需要在每个epoch后都返回训练结果,可以使用回传工具将/tmp/output文件夹的内容及时传到启智以供下载,具体写法为:

```
os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")
```

## 1 概述

- 本项目以#LeNet5-MNIST-PyTorch为例,简要介绍如何在启智AI协同平台上使用Pytorch完成训练任务,包括单数据集的训练,多数据集的训练,智算网络的训练,旨在为AI开发者提供启智训练示例。
- 用户可以直接使用本项目提供的数据集和代码文件创建自己的训练任务。

## 2 准备工作

- 启智平台使用准备,本项目需要用户创建启智平台账户,克隆代码到自己的账户,上传数据集,具体操作方法可以通过访问[OpenI_Learning](https://git.openi.org.cn/zeizei/OpenI_Learning)项目学习小白训练营系列课程进行学习。

### 2.1 数据准备

#### 数据集获取

- 如果你需要试运行本示例,则无需再次上传数据集,因为本示例中的数据集MnistDataset_torch.zip已经设置为公开数据集,可以直接引用,数据集也可从本项目的数据集目录中下载并查看数据结构,[MNISTDataset_torch.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0),[mnist_epoch1_0.73.pkl.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0)。
- 数据文件说明
- MNISTData数据集是由10类28∗28的灰度图片组成,训练数据集包含60000张图片,测试数据集包含10000张图片。
- 数据集压缩包的目录结构如下:
> MNISTDataset_torch.zip
> ├── test
> │ └── MNIST
> │ │── raw
> │ │ ├── t10k-images-idx3-ubyte
> │ │ └── t10k-labels-idx1-ubyte
> │ │ ├── train-images-idx3-ubyte
> │ │ └── train-labels-idx1-ubyte
> │ └── processed
> │ ├── test.pt
> │ └── training.pt
> └── train
> └── MNIST
> │── raw
> │ ├── t10k-images-idx3-ubyte
> │ └── t10k-labels-idx1-ubyte
> │ ├── train-images-idx3-ubyte
> │ └── train-labels-idx1-ubyte
> └── processed
> ├── test.pt
> └── training.pt

> mnist_epoch1_0.73.pkl.zip
> ├── mnist_epoch1_0.73.pkl
- MNISTData数据集是由10类28∗28的灰度图片组成,训练数据集包含60000张图片,测试数据集包含10000张图片。
- 数据集压缩包的目录结构如下:

> MNISTDataset_torch.zip
> ├── test
> │ └── MNIST
> │ │── raw
> │ │ ├── t10k-images-idx3-ubyte
> │ │ └── t10k-labels-idx1-ubyte
> │ │ ├── train-images-idx3-ubyte
> │ │ └── train-labels-idx1-ubyte
> │ └── processed
> │ ├── test.pt
> │ └── training.pt
> └── train
> └── MNIST
> │── raw
> │ ├── t10k-images-idx3-ubyte
> │ └── t10k-labels-idx1-ubyte
> │ ├── train-images-idx3-ubyte
> │ └── train-labels-idx1-ubyte
> └── processed
> ├── test.pt
> └── training.pt
>

> mnist_epoch1_0.73.pkl.zip
> ├── mnist_epoch1_0.73.pkl
>

#### 数据集上传

使用GPU进行训练,需要在GPU芯片上运行,所以上传的数据集需要传到GPU界面。(此步骤在本示例中不需要,可直接选择公开数据集MNISTDataset_torch.zip)

### 2.2 执行脚本准备

#### 示例代码

- 示例代码可从本仓库中下载,[代码下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU)
- 代码文件说明
- [train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py),用于单数据集训练的脚本文件。具体说明请参考[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)
- [train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py),用于多数据集训练的脚本文件。具体说明请参考[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)
- [train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py),用于智算网络训练的脚本文件。具体说明请参考[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)
- [model.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/model.py),使用的训练网络,在单数据集训练,多数据集训练,智算网络训练中使用到。
- [train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py),用于单数据集训练的脚本文件。具体说明请参考[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)
- [train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py),用于多数据集训练的脚本文件。具体说明请参考[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)
- [train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py),用于智算网络训练的脚本文件。具体说明请参考[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)
- [model.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/model.py),使用的训练网络,在单数据集训练,多数据集训练,智算网络训练中使用到。

## 3 创建训练任务

准备好数据和执行脚本以后,需要创建训练任务将Pytorch脚本运行。首次使用的用户可参考本示例代码。

### 训练界面示例

由于A100的适配性问题,A100需要使用cuda11以上的cuda版本,目前平台已提供基于A100的cuda基础镜像,只需要选择对应的公共镜像:
![avatar](Example_picture/适用A100的基础镜像.png)
训练界面参数参考如下:
![avatar](Example_picture/基础镜像.png)


表1 创建训练作业界面参数说明

| 参数名称 | 说明 |
| ----------------- | ----------- |
| 计算资源 | 选择CPU/GPU |
| 代码分支 | 选择仓库代码中要使用的代码分支,默认可选择master分支 |
| 镜像 | 镜像选择已在调试环境中调试好的镜像,目前版本请选择基础镜像:平台提供基于A100的cuda基础镜像,如dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191|
| 启动文件 | 启动文件选择代码目录下的启动脚本train.py |
| 数据集 | 数据集选择已上传到启智平台的公共数据集MnistDataset_torch.zip |
| 运行参数 | 增加运行参数可以向脚本中其他参数传值,如epoch_size |
| 资源规格 | 规格选择含有GPU个数的规格|
| 参数名称 | 说明 |
| -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 计算资源 | 选择CPU/GPU |
| 代码分支 | 选择仓库代码中要使用的代码分支,默认可选择master分支 |
| 镜像 | 镜像选择已在调试环境中调试好的镜像,目前版本请选择基础镜像:平台提供基于A100的cuda基础镜像,如dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191 |
| 启动文件 | 启动文件选择代码目录下的启动脚本train.py |
| 数据集 | 数据集选择已上传到启智平台的公共数据集MnistDataset_torch.zip |
| 运行参数 | 增加运行参数可以向脚本中其他参数传值,如epoch_size |
| 资源规格 | 规格选择含有GPU个数的规格 |

## 4 查看运行结果

### 4.1 在训练作业界面可以查看运行日志

目前训练任务的日志只能在代码中print输出,参考示例train.py代码相关print

### 4.2 训练结束后可以下载模型文件

![avatar](Example_picture/结果下载.png)

## 对于示例代码有任何问题,欢迎在本项目中提issue。
## 对于示例代码有任何问题,欢迎在本项目中提issue。

+ 5
- 1
inference.py View File

@@ -32,6 +32,7 @@ from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import os
import argparse
from model import Model



@@ -53,7 +54,10 @@ if __name__ == '__main__':
#如果文件名确定,model_path可以直接写死
model_path = '/model/'+args.modelname

model = torch.load(model_path).to(device)
model = Model().to(device)
checkpoint = torch.load(model_path)
model.load_state_dict(checkpoint['model'])
model.eval()

correct = 0


+ 125
- 0
pretrain.py View File

@@ -0,0 +1,125 @@
#!/usr/bin/python
#coding=utf-8
'''
If there are Chinese comments in the code,please add at the beginning:
#!/usr/bin/python
#coding=utf-8

1,The dataset structure of the single-dataset in this example
MnistDataset_torch.zip
├── test
└── train

2,Due to the adaptability of a100, before using the training environment, please use the recommended image of the
platform with cuda 11.Then adjust the code and submit the image.
The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
In the training environment, the uploaded dataset will be automatically placed in the /dataset directory.
Note: the paths are different when selecting a single dataset and multiple datasets.
(1)If it is a single dataset: if MnistDataset_torch.zip is selected,
the dataset directory is /dataset/train, /dataset/test;
If it is a multiple dataset: if MnistDataset_torch.zip is selected,
the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test;

(2)If the pre-training model file is selected, the selected pre-training model path save as parameter ckpt_url;

The model download path is under /model by default. Please specify the model output location to /model,
and the Qizhi platform will provide file downloads under the /model directory.
'''


from model import Model
import numpy as np
import torch
from torchvision.datasets import mnist
from torch.nn import CrossEntropyLoss
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import argparse
import os

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
#The dataset location is placed under /dataset
parser.add_argument('--traindata', default="/dataset/train" ,help='path to train dataset')
parser.add_argument('--testdata', default="/dataset/test" ,help='path to test dataset')
parser.add_argument('--epoch_size', type=int, default=10, help='how much epoch to train')
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')
#获取模型文件名称
parser.add_argument('--ckpt_url', default="", help='pretrain model path')

# 参数声明
WORKERS = 0 # dataloder线程数
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Model().to(device)
optimizer = SGD(model.parameters(), lr=1e-1)
cost = CrossEntropyLoss()

# 模型训练
def train(model, train_loader, epoch):
model.train()
train_loss = 0
for i, data in enumerate(train_loader, 0):
x, y = data
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
y_hat = model(x)
loss = cost(y_hat, y)
loss.backward()
optimizer.step()
train_loss += loss
loss_mean = train_loss / (i+1)
print('Train Epoch: {}\t Loss: {:.6f}'.format(epoch, loss_mean.item()))
# 模型测试
def test(model, test_loader, test_data):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for i, data in enumerate(test_loader, 0):
x, y = data
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
y_hat = model(x)
test_loss += cost(y_hat, y).item()
pred = y_hat.max(1, keepdim=True)[1]
correct += pred.eq(y.view_as(pred)).sum().item()
test_loss /= (i+1)
print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_data), 100. * correct / len(test_data)))
def main():
# 如果有保存的模型,则加载模型,并在其基础上继续训练
if os.path.exists(args.ckpt_url):
checkpoint = torch.load(args.ckpt_url)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
start_epoch = checkpoint['epoch']
print('加载 epoch {} 权重成功!'.format(start_epoch))
else:
start_epoch = 0
print('无保存模型,将从头开始训练!')
for epoch in range(start_epoch+1, epochs):
train(model, train_loader, epoch)
test(model, test_loader, test_dataset)
# 保存模型
state = {'model':model.state_dict(), 'optimizer':optimizer.state_dict(), 'epoch':epoch}
torch.save(state, '/model/mnist_epoch{}.pkl'.format(epoch))

if __name__ == '__main__':
args, unknown = parser.parse_known_args()
#log output
print('cuda is available:{}'.format(torch.cuda.is_available()))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
epochs = args.epoch_size
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
main()



+ 141
- 0
pretrain_for_c2net.py View File

@@ -0,0 +1,141 @@
#!/usr/bin/python
#coding=utf-8
'''
If there are Chinese comments in the code,please add at the beginning:
#!/usr/bin/python
#coding=utf-8

In the training environment,
(1)the code will be automatically placed in the /tmp/code directory,
(2)the uploaded dataset will be automatically placed in the /tmp/dataset directory
Note: the paths are different when selecting a single dataset and multiple datasets.
(1)If it is a single dataset: if MnistDataset_torch.zip is selected,
the dataset directory is /tmp/dataset/train, /dataset/test;

The dataset structure of the single dataset in the training image in this example:
tmp
├──dataset
├── test
└── train

If multiple datasets are selected, such as MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip,
the dataset directory is /tmp/dataset/MnistDataset_torch/train, /tmp/dataset/MnistDataset_torch/test
and /tmp/dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl
The dataset structure in the training image for multiple datasets in this example:
tmp
├──dataset
├── MnistDataset_torch
| ├── test
| └── train
└── checkpoint_epoch1_0.73
├── mnist_epoch1_0.73.pkl
(3)the model download path is under /tmp/output by default, please specify the model output location to /tmp/output,
qizhi platform will provide file downloads under the /tmp/output directory.
(4)If the pre-training model file is selected, the selected pre-training model path save as parameter ckpt_url;

In addition, if you want to get the model file after each training, you can call the uploader_for_gpu tool,
which is written as:
import os
os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")
'''


from model import Model
import numpy as np
import torch
from torchvision.datasets import mnist
from torch.nn import CrossEntropyLoss
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import argparse
import os

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
#The dataset location is placed under /dataset
parser.add_argument('--traindata', default="/tmp/dataset/train" ,help='path to train dataset')
parser.add_argument('--testdata', default="/tmp/dataset/test" ,help='path to test dataset')
parser.add_argument('--epoch_size', type=int, default=10, help='how much epoch to train')
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')
#获取模型文件名称
parser.add_argument('--ckpt_url', default="", help='pretrain model path')

# 参数声明
WORKERS = 0 # dataloder线程数
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Model().to(device)
optimizer = SGD(model.parameters(), lr=1e-1)
cost = CrossEntropyLoss()

# 模型训练
def train(model, train_loader, epoch):
model.train()
train_loss = 0
for i, data in enumerate(train_loader, 0):
x, y = data
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
y_hat = model(x)
loss = cost(y_hat, y)
loss.backward()
optimizer.step()
train_loss += loss
loss_mean = train_loss / (i+1)
print('Train Epoch: {}\t Loss: {:.6f}'.format(epoch, loss_mean.item()))
# 模型测试
def test(model, test_loader, test_data):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for i, data in enumerate(test_loader, 0):
x, y = data
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
y_hat = model(x)
test_loss += cost(y_hat, y).item()
pred = y_hat.max(1, keepdim=True)[1]
correct += pred.eq(y.view_as(pred)).sum().item()
test_loss /= (i+1)
print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_data), 100. * correct / len(test_data)))
def main():
# 如果有保存的模型,则加载模型,并在其基础上继续训练
if os.path.exists(args.ckpt_url):
checkpoint = torch.load(args.ckpt_url)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
start_epoch = checkpoint['epoch']
print('加载 epoch {} 权重成功!'.format(start_epoch))
else:
start_epoch = 0
print('无保存模型,将从头开始训练!')
for epoch in range(start_epoch+1, epochs):
train(model, train_loader, epoch)
test(model, test_loader, test_dataset)
# 保存模型
state = {'model':model.state_dict(), 'optimizer':optimizer.state_dict(), 'epoch':epoch}
torch.save(state, '/tmp/output/mnist_epoch{}.pkl'.format(epoch))
#After calling uploader_for_gpu, after each epoch training, the result file under /tmp/output will be sent back to Qizhi
os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")

if __name__ == '__main__':
args, unknown = parser.parse_known_args()
#log output
print('cuda is available:{}'.format(torch.cuda.is_available()))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
batch_size = args.batch_size
epochs = args.epoch_size
train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
main()



+ 0
- 1
test.py View File

@@ -1 +0,0 @@
from einops import rearrange

+ 29
- 1
train.py View File

@@ -30,6 +30,7 @@ from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import argparse
import os

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
@@ -39,7 +40,33 @@ parser.add_argument('--testdata', default="/dataset/test" ,help='path to test da
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

# 参数声明
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Model().to(device)
optimizer = SGD(model.parameters(), lr=1e-1)

if __name__ == '__main__':
print(os.path.abspath(__file__)) # 获取当前目录

currentPath = os.getcwd() #获取当前目录
print(currentPath)
print(os.path.dirname(os.path.abspath(__file__)))

print(os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))) #获取上级目录

print("bbbbbbbbbbbbbbbbbbbbbbbbb")
print(os.listdir('/')) #获取根目录文件
print("ccccccccccccccccccccccccc")
print(os.listdir()) #获取当前目录文件
print("ddddddddddddddddddddddddd")
print(os.system('ls -l'))
print("*************************")
result1 = os.system('ls')
print(result1) # 输出为0
print(f'result1 = {result1}')

l = os.popen('ls')
print(l.readlines())
args, unknown = parser.parse_known_args()
#log output
print('cuda is available:{}'.format(torch.cuda.is_available()))
@@ -83,4 +110,5 @@ if __name__ == '__main__':
_sum += _.shape[0]
print('accuracy: {:.2f}'.format(correct / _sum))
#The model output location is placed under /model
torch.save(model, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))
state = {'model':model.state_dict(), 'optimizer':sgd.state_dict(), 'epoch':epoch}
torch.save(state, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))

+ 40
- 2
train_for_c2net.py View File

@@ -7,9 +7,38 @@ If there are Chinese comments in the code,please add at the beginning:

In the training environment,
the code will be automatically placed in the /tmp/code directory,
the uploaded dataset will be automatically placed in the /tmp/dataset directory, and
the uploaded dataset will be automatically placed in the /tmp/dataset directory

Note: the paths are different when selecting a single dataset and multiple datasets.
(1)If it is a single dataset: if MnistDataset_torch.zip is selected,
the dataset directory is /tmp/dataset/train, /dataset/test;

The dataset structure of the single dataset in the training image in this example:
tmp
├──dataset
├── test
└── train

If multiple datasets are selected, such as MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip,
the dataset directory is /tmp/dataset/MnistDataset_torch/train, /tmp/dataset/MnistDataset_torch/test
and /tmp/dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl
The dataset structure in the training image for multiple datasets in this example:
tmp
├──dataset
├── MnistDataset_torch
| ├── test
| └── train
└── checkpoint_epoch1_0.73
├── mnist_epoch1_0.73.pkl


the model download path is under /tmp/output by default, please specify the model output location to /tmp/output,
qizhi platform will provide file downloads under the /tmp/output directory.

In addition, if you want to get the model file after each training, you can call the uploader_for_gpu tool,
which is written as:
import os
os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")
'''


@@ -22,6 +51,7 @@ from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import argparse
import os

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
@@ -31,6 +61,11 @@ parser.add_argument('--testdata', default="/tmp/dataset/test" ,help='path to tes
parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

# 参数声明
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Model().to(device)
optimizer = SGD(model.parameters(), lr=1e-1)

if __name__ == '__main__':
args, unknown = parser.parse_known_args()
#log output
@@ -75,4 +110,7 @@ if __name__ == '__main__':
_sum += _.shape[0]
print('accuracy: {:.2f}'.format(correct / _sum))
#The model output location is placed under /model
torch.save(model, '/tmp/output/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))
state = {'model':model.state_dict(), 'optimizer':optimizer.state_dict(), 'epoch':epoch}
torch.save(state, '/tmp/output/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))
#After calling uploader_for_gpu, after each epoch training, the result file under /tmp/output will be sent back to Qizhi
os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")

Loading…
Cancel
Save