环境预备
-
所有节点OS版本:ubuntu 18.04
-
所有节点root密码一致且允许SSH访问
-
所有节点配置短域名解析
/etc/hosts配置如下 (依单master双worker节点为例)
127.0.0.1 localhost
your_master_local_ip master
your_master_local_ip master.sigsus.cn
your_worker_local_ip worker01
your_worker_local_ip worker01.sigsus.cn
your_second_worker_local_ip worker02
your_second_worker_local_ip worker02.sigsus.cn
-
配置harbor(master节点)
-
安装harbor
-
配置harbor
-
配置证书
参考此 配置步骤
注意:
1)建议依照域名harbor.sigsus.cn生成证书
2)此域名必须与config.yaml中private_docker_registry字段所包含的仓库地址一致
-
更改端口与证书路径
https端口:8443
certificate: /opt/harbor/cert/harbor.sigsus.cn.crt
private_key: /opt/harbor/cert/harbor.sigsus.cn.key
-
启动harbor
cd /opt/harbor/
docker-compose up -d
编译组件
组件信息
编译组件
1. 配置编译环境
-
创建虚拟环境
virtualenv -p python2.7 pythonenv2.7
. pythonenv2.7/bin/activate
-
安装python安装包
cd DLWorkspace/src/ClusterBootstrap/
pip install -r scripts/requirements.txt
-
安装golang(可选;Atlas特定组件编译;仅当集群存在NPU计算设备时使用)
参考:https://golang.org/doc/install
2. 编译DLTS Main PROJECT组件
编译restfulapi2
cd DLWorkspace/src/ClusterBootstrap/
./deploy.py docker build restfulapi2
编译init-container
./deploy.py docker push init-container
编译job-exporter
./deploy.py docker push job-exporter
编译gpu-reporter
./deploy.py docker push gpu-reporter
编译watchdog
./deploy.py docker push watchdog
编译repairmanager2
./deploy.py docker push repairmanager2
3. 编译AIArts-Frontend
cd AIArts-Frontend/
docker build -t dlworkspace_aiarts-frontend:1.0.0 .
4. 编译AIArts-Backend
cd AIArtsBackend/deployment/
bash build.sh
5. 编译user-dashboard-frontend
cd user-dashboard-frontend/
docker build -t dlworkspace_custom-user-dashboard-frontend:latest .
6. 编译user-dashboard-backend
cd user-dashboard-backend/
docker build -t dlworkspace_custom-user-dashboard-backend:latest .
7. 编译image-label-frontend
cd NewObjectLabel/
docker build -t dlworkspace_image-label:latest .
8. 编译image-label-backend
cd DLWorkspace/src/ClusterBootstrap/
./deploy.py docker push data-platform-backend
9. 编译ascend-for-volcano
-
创建目录
mkdir -p ${GOPATH}/{src/github.com/google,src/k8s.io,src/volcano.sh}
-
将软件包中获取的ascend-for-volcano文件夹上传到“${GOPATH}/src/volcano.sh/“目录下,并将文件夹重命名为volcano
-
创建build文件夹
cd ${GOPATH}/src/volcano.sh/volcano/
mkdir -p build
-
创建并编辑build.sh
执行cd ${GOPATH}/src/volcano.sh/volcano/build
执行vim build.sh
, 输入:
#!/bin/sh
cd ${GOPATH}/src/volcano.sh/volcano/
make clean
export PATH=$GOPATH/bin:$PATH
export GO111MODULE=off
export GOMOD=""
export GIT_SSL_NO_VERIFY=1
make image_bins
make images
make generate-yaml
mkdir _output/DockFile/
docker save -o _output/DockFile/vc-webhook-manager-base.tar.gz volcanosh/vc-webhook-manager-base
docker save -o _output/DockFile/vc-webhook-manager.tar.gz volcanosh/vc-webhook-manager
docker save -o _output/DockFile/vc-controller-manager.tar.gz volcanosh/vc-controller-manager
docker save -o _output/DockFile/vc-vc-scheduler.tar.gz volcanosh/vc-scheduler
-
编译镜像
chmod +x build.sh
./build.sh
-
查看镜像
docker images | grep volcanosh
10. 编译ascend-device-plugin
以下操作位于Atlas服务器
-
登录atlas服务器,安装golang
-
配置golang编译环境
执行:vim ~/.bashrc
, 输入以下内容并保存:
export GO111MODULE=on
export GOPROXY=https://gocenter.io
export GONOSUMDB=*
-
将ascend-device-plugin文件夹上传到任意目录(如“/home”)
-
在ascend-device-plugin目录下创建prepare_build.sh文件
cd /home/ascend-device-plugin/build
vim prepare_build.sh
根据实际写入以下内容:
#!/bin/bash
ASCNED_TYPE=910 #根据芯片类型选择310或910。
ASCNED_INSTALL_PATH=/usr/local/Ascend #驱动安装路径,根据实际修改。
USE_ASCEND_DOCKER=false #是否使用昇腾Docker,请修改为false。
CUR_DIR=$(dirname $(readlink -f $0))
TOP_DIR=$(realpath ${CUR_DIR}/..)
LD_LIBRARY_PATH_PARA1=${ASCNED_INSTALL_PATH}/driver/lib64/driver
LD_LIBRARY_PATH_PARA2=${ASCNED_INSTALL_PATH}/driver/lib64
apt-get install -y pkg-config
apt-get install -y dos2unix
TYPE=Ascend910
PKG_PATH=${TOP_DIR}/src/plugin/config/config_910
PKG_PATH_STRING=\$\{TOP_DIR\}/src/plugin/config/config_910
LIBDRIVER="driver/lib64/driver"
if [ ${ASCNED_TYPE} == "310" ]; then
TYPE=Ascend310
LD_LIBRARY_PATH_PARA1=${ASCNED_INSTALL_PATH}/driver/lib64
PKG_PATH=${TOP_DIR}/src/plugin/config/config_310
PKG_PATH_STRING=\\$\\{TOP_DIR\\}/src/plugin/config/config_310
LIBDRIVER="/driver/lib64"
fi
sed -i "s/Ascend[0-9]\\{3\\}/${TYPE}/g" ${TOP_DIR}/ascendplugin.yaml
sed -i "s#ath: /usr/local/Ascend/driver#ath: ${ASCNED_INSTALL_PATH}/driver#g" ${TOP_DIR}/ascendplugin.yaml
sed -i "/^ENV LD_LIBRARY_PATH /c ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH_PARA1}:${LD_LIBRARY_PATH_PARA2}/common" ${TOP_DIR}/Dockerfile
sed -i "/^ENV USE_ASCEND_DOCKER /c ENV USE_ASCEND_DOCKER ${USE_ASCEND_DOCKER}" ${TOP_DIR}/Dockerfile
sed -i "/^libdriver=/c libdriver=$\\{prefix\\}/${LIBDRIVER}" ${PKG_PATH}/ascend_device_plugin.pc
sed -i "/^prefix=/c prefix=${ASCNED_INSTALL_PATH}" ${PKG_PATH}/ascend_device_plugin.pc
sed -i "/^CONFIGDIR=/c CONFIGDIR=${PKG_PATH_STRING}" ${CUR_DIR}/build_in_docker.sh
-
编译镜像
chmod +x prepare_build.sh
./prepare_build.sh
chmod +x build_910.sh
dos2unix build_910.sh
./build_910.sh dockerimages
-
检查镜像
docker images | grep deviceplugin
11. 编译kfserving
IMAGE_PUSH_HUB_URL=harbor.sigsus.cn/sz_gongdianju/apulistech
或者
IMAGE_PUSH_HUB_URL=apulistech
./scripts/kfserving.sh push istio
./scripts/kfserving.sh push knative
./scripts/kfserving.sh push kfserving
执行部署
1. 设定配置文件
cd DLWorkspace/src/ClusterBootstrap/
vim config.yaml
cluster_name: atlas
network:
domain: sigsus.cn
container-network-iprange: "10.0.0.0/8"
UserGroups:
DLWSAdmins:
Allowed:
- jinlmsft@hotmail.com
gid: "20001"
uid: "20000"
DLWSRegister:
Allowed:
- '@gmail.com'
- '@live.com'
- '@outlook.com'
- '@hotmail.com'
- '@apulis.com'
gid: "20001"
uid: 20001-29999
WebUIadminGroups:
- DLWSAdmins
WebUIauthorizedGroups:
- DLWSAdmins
WebUIregisterGroups:
- DLWSRegister
datasource: MySQL
mysql_password: apulis#2019#wednesday
webuiport: 3081
useclusterfile : true
machines:
master:
role: infrastructure
private-ip: 192.168.3.2
archtype: arm64
type: npu
vendor: huawei
worker01:
archtype: amd64
role: worker
type: gpu
vendor: nvidia
os: ubuntu
worker02:
archtype: amd64
role: worker
type: gpu
vendor: nvidia
os: ubuntu
# settings for docker
private_docker_registry: harbor.sigsus.cn:8443/dlts/
dockerregistry: apulistech/
dockers:
hub: apulistech/
tag: "1.9"
dataFolderAccessPoint: ''
Authentications:
Microsoft:
TenantId:
ClientId:
ClientSecret:
Wechat:
AppId:
AppSecret:
mountpoints:
nfsshare1:
type: nfs
server: master
filesharename: /mnt/local
curphysicalmountpoint: /mntdlws
mountpoints: ""
repair-manager:
cluster_name: "atlas"
ecc_rule:
cordon_dry_run: True
alert:
smtp_url: smtp.qq.com
login:
password:
sender:
receiver: ["XXX@XXX.com"]
enable_custom_registry_secrets: True
platform_name: Apulis Platform
kube-vip: XXX.XXX.XXX.XXX
配置信息说明
-
需依照实际情况修改的字段包括以下
1)machines:修改机器IP兼短域名
2)dockerregistry:修改hub.docker中的组织名
3)kube-vip:单master情况下,填入master节点内网IP
-
与其它环节存在依赖字段包括
1)private_docker_registry:此字段中harbor.sigsus.cn:8443应与配置harbor环节所采用的域名保持一致
2. 执行部署
-
切换目录:cd DLWorkspace/src/ClusterBootstrap/
-
安装部署节点环境
./scripts/prepare_ubuntu_dev.sh
-
安装集群节点环境
./deploy.py --verbose sshkey install
./deploy.py --verbose runscriptonall ./scripts/prepare_ubuntu.sh
./deploy.py --verbose runscriptonall ./scripts/prepare_ubuntu.sh continue
./deploy.py --verbose execonall sudo usermod -aG docker dlwsadmin
-
安装K8S集群
./deploy.py --verbose execonall sudo swapoff -a
./deploy.py runscriptonroles infra worker ./scripts/install_kubeadm.sh
./deploy.py --verbose kubeadm init
./deploy.py --verbose copytoall ./deploy/sshkey/admin.conf /root/.kube/config
./deploy.py --verbose kubeadm join
./deploy.py --verbose -y kubernetes labelservice
./deploy.py --verbose -y labelworker
-
渲染集群配置
./deploy.py renderservice
./deploy.py renderimage
./deploy.py webui
./deploy.py nginx webui3
./deploy.py nginx fqdn
./deploy.py nginx config
-
挂载共享存储
./deploy.py runscriptonroles infra worker ./scripts/install_nfs.sh
./deploy.py --force mount
./deploy.py execonall "df -h"
-
启动集群服务器
./deploy.py kubernetes start nvidia-device-plugin
./deploy.py kubernetes start a910-device-plugin
./deploy.py kubernetes start mysql
./deploy.py kubernetes start jobmanager2 restfulapi2 nginx custommetrics repairmanager2 openresty
./deploy.py --sudo --background runscriptonall scripts/npu/npu_info_gen.py
./deploy.py kubernetes start monitor
./deploy.py kubernetes start istio
./deploy.py kubernetes start knative kfserving
./deploy.py kubernetes start webui3 custom-user-dashboard image-label aiarts-frontend aiarts-backend data-platform