过时了，请使用更新后的文档

前提条件

集群内主机系统为Ubuntu 16.04 LST，且具有相同用户名和密码的管理员账户。
每一台机器的代理软件监听地址为http://127.0.0.1:8118，自己部署，不在本文范围以内。

初始化

所有机器

切换清华源

https://mirrors.tuna.tsinghua.edu.cn/help/ubuntu/

更新依赖

1 2	sudo apt update sudo apt upgrade

安装openssh-server

1	sudo apt install openssh-server

安装docker

https://docs.docker.com/engine/install/ubuntu/

sudo apt remove docker docker-engine docker.io containerd runc
sudo apt update
sudo apt install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo nano /etc/apt/sources.list.d/download_docker_com_linux_ubuntu.list

这里不按照上面官网提供的方法做的原因是pai是这样操作的，否则会导致源重复，apt update会报错。

1	deb https://download.docker.com/linux/ubuntu xenial stable

1
2
3

sudo dpkg --remove-architecture i386
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io

安装python

按理说Ubuntu默认是有安装python的，但是我在实际操作的时候发现有的机器就是没有python，所以手动确认以下，以避免报错。

1	sudo apt install python

master

安装ntp

1	sudo apt install ntp

worker

安装GPU驱动

https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

https://openpai.readthedocs.io/zh_CN/latest/manual/cluster-admin/installation-faqs-and-troubleshooting.html#how-to-check-whether-the-gpu-driver-is-installed

https://howtoinstall.co/en/ubuntu/xenial/xserver-xorg?action=remove

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-430
sudo apt autoremove xserver-xorg
sudo apt autoremove --purge xserver-xorg
sudo reboot

安装nvidia-container-runtime

https://github.com/NVIDIA/nvidia-container-runtime#installation

https://openpai.readthedocs.io/zh_CN/latest/manual/cluster-admin/installation-faqs-and-troubleshooting.html#how-to-install-nvidia-container-runtime

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt update
sudo apt install nvidia-container-runtime
sudo nano /etc/docker/daemon.json

{
  "default-runtime": "nvidia",
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []
      }
  }
}

1	sudo systemctl restart docker

测试驱动是否正常

1	sudo docker run nvidia/cuda:10.0-base nvidia-smi

设置GPU常驻内存

运行此命令设置GPU常驻内存，可解决GPU启动缓慢、无任务运行但是利用率居高不下、偶尔丢卡等问题。

1	sudo nvidia-smi -pm 1

上述命令系统重启之后会失效，要在系统层面生效可将nvidia-smi -pm 1加入/etc/rc.local，注意要放在exit 0之前。

devbox

配置免密登录

生成密钥

1	ssh-keygen

向远程主机注册密钥

1	ssh-copy-id username@remote_host

测试是否可免密登录

1	ssh username@remote_host

启动代理

集群中的机器由于需要拉取gcr.io的镜像，这在国内是无法访问的，故需要配置docker代理（openpai官网给的镜像已不可用）。devbox机器要下载pai的源码，如果访问github速度慢的话，可配置代理。

启动docker代理

https://docs.docker.com/config/daemon/systemd/#httphttps-proxy

这是docker命令用的代理

1 2	sudo mkdir -p /etc/systemd/system/docker.service.d sudo nano /etc/systemd/system/docker.service.d/http-proxy.conf

1
2
3

[Service]
Environment="HTTP_PROXY=http://127.0.0.1:8118"
Environment="HTTPS_PROXY=http://127.0.0.1:8118"

1
2
3

sudo systemctl daemon-reload
sudo systemctl restart docker
sudo systemctl show --property=Environment docker

1	sudo docker pull gcr.io/google-containers/kube-apiserver:v1.15.11

启动git代理

启动命令

1 2	git config --global http.proxy http://127.0.0.1:8118 git config --global https.proxy http://127.0.0.1:8118

关闭命令

1 2	git config --global --unset https.proxy git config --global --unset http.proxy

开始安装

下面都在devbox中操作

https://openpai.readthedocs.io/zh_CN/latest/manual/cluster-admin/installation-guide.html#_4

编写参数文件

参考上面格式直接在家目录建立三个文件，分别为master.csv，worker.csv，config。

由于已经使用了代理，故不再使用官方文档里面的gcr.io镜像！所以我们只需要下面这个简单的配置文件即可。

user: <your-ssh-username>
password: <your-ssh-password>
branch_name: pai-1.3.y
docker_image_tag: v1.3.0

kubeadm_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/kubeadm"
hyperkube_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/hyperkube"

按步骤运行

git clone https://github.com/microsoft/pai.git
cd pai
git checkout pai-1.3.y
cd contrib/kubespray
/bin/bash quick-start-kubespray.sh -m ~/master.csv -w ~/worker.csv -c ~/config
/bin/bash quick-start-service.sh -m ~/master.csv -w ~/worker.csv -c ~/config

后续

成功启动service之后，除了安装MarketPlace插件以外，其他个性化配置过程一般不会有问题，除非配置文件写错了。

MarketPlace配置

https://github.com/siaimes/openpaimarketplace

再不努力就老咯！

Openpai v1.3.0部署总结