GPU服務器安裝NVIDIA顯卡驅動

1、確認服務器系統版本爲16.04.02 (每臺都需要操作)
預安裝準備參考官網:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions

for i in xsgpu81 xsgpu82  xsgpu83 xsgpu84 xsgpu85; do qssh root@$i 'cat /etc/issue;uname -r';done
Ubuntu 16.04.2 LTS \n \l
4.4.0-62-genericmodprobe

2、下載nvidia driver驅動並安裝
可能需要 service lighted stop, 如果機器不乾淨(之前裝過gpu相關的東西)的話

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.26/NVIDIA-Linux-x86_64-375.26.run
root@xsgpu81:~# sudo sh NVIDIA-Linux-x86_64-375.26.run
Accept
OK
OK
OK

3、安裝cuda

wget http://ogo0b6qe6.bkt.clouddn.com/cuda_8.0.61_375.26_linux.run
chmod +x cuda_8.0.61_375.26_linux.run
sudo sh cuda_8.0.61_375.26_linux.run --silent
echo "PATH=/usr/local/cuda-8.0/bin:$PATH" >> /root/.bashrc
echo "LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH" >> /root/.bashrc
source /root/.bashrc

4、拷貝測試文件

qscp NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd/vectorAdd root@xsgpu81:/root/
root@xsgpu81:~# ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

人肉部署含GPU設備的mesos-agent節點
按照標準流程在GPU機器上部署mesos-agent及其它基礎服務(boots-docker, consul, logbeat)
人肉流程:
停含有GPU機器上的mesos-agent服務 supervisorctl stop mesos-agent
清理mesos-agent work_dir
rm -rf cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir
進入到mesos-agent配置文件目錄 /home/qboxserver/mesos-agent/current/conf/mesos-agent更新配置
獲取機器上的GPU設備數和型號nvidia-smi -L, 列出的GPU設備數即爲設備總數
將設備型號寫入到attributes文件 echo "NETWORK:BRIDGE;GPU_MODEL:$MODEL” > attributes
增加isolation配置 echo "cgroups/devices,gpu/nvidia“ > isolation
標識可用的gpu設備編號 echo “0, 1, …, 設備總數 - 1” > nvidia_gpu_devices
resources中增加gpu資源{"name":"gpus","type":"SCALAR","scalar":{"value”:設備總數}}
進入/home/qboxserver/mesos-agent/current/libexec/mesos替換executor
保留原始的executor mv mesos-docker-executor mesos-docker-executor.cpp
下載gpu executor

wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2017-11-18
mv mesos-docker-executor-2017-11-18 mesos-docker-executor; chown qboxserver.qboxserver mesos-docker-executor
cp mesos-docker-executor.go mesos-docker-executor

安裝nvidia-docker-plugin
cd /home/qboxserver && mkdir nvidia-docker
cd /home/qboxserver/nvidia-docker
wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
ln -s 2016-11-22-20-45-30 current
./current/bin/start.sh
curl -s http://localhost:3476/v1.0/gpu/info 查看gpu設備信息
啓動mesos-agent服務
升級GPU 驅動(嘗試使用apt-get安裝驅動)

apt-get purge nvidia*
add-apt-repository ppa:graphics-drivers
apt-get update
apt-get install nvidia-<version>
reboot

安裝配套的cadvisor

cd /home/qboxserver/boots-cadvisor/current/bin && \
mv cadvisor cadvisor.bak && \
wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor && \
chmod +x cadvisor && \
chown qboxserver:qboxserver cadvisor && \
./start.sh

原理:
http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux
http://mesos.apache.org/documentation/latest/gpu-support/
https://github.com/NVIDIA/nvidia-docker/wiki

xs區域新上線GPU計算節點7臺
版本升級步驟:
有些服務會佔用gpu, 升級之前這些服務要停掉:

  1. service lightdm stop (有些機器開了這個,有些沒有)
    dockerd nvidia-docker-plugin boots-cadvisor stop
  2. 卸載原來的內核模塊
    modprobe -r nvidia nvidia_drm nvidia_uvm
    有時候卸載不成功 lsof |grep nvidia 看那個進程還在用,殺掉該進程,重試。
    lsmod |grep nvidia 沒有的時候說明老的驅動被卸載乾淨,可以開始安裝。
  3. wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run
    sh NVIDIA-Linux-x86_64-396.44.run --slient

    執行完畢後:
    nvidia-smi 查看是否安裝成功
    重啓機器

升級實例:
1、查看原來的版本

root@xsgpu9:~# nvidia-smi
 NVIDIA-SMI 375.26

2、查看正在使用的模塊

root@xsgpu9:~#  lsmod | grep -i nvidia
nvidia_drm             53248  0
nvidia_modeset        790528  1 nvidia_drm
nvidia              11943936  1 nvidia_modeset
drm_kms_helper        143360  2 ast,nvidia_drm
drm                   360448  5 ast,ttm,drm_kms_helper,nvidia_drm

3、卸載相關的模塊
modprobe -r nvidia_drm nvidia_modeset nvidia

4、下載新的版本
root@xsgpu9:~# wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run

5、安裝新版本
sh NVIDIA-Linux-x86_64-396.44.run --silent

6、查看新版本

 nvidia-smi
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |

xs311 apt -get安裝了nvidia的驅動,刪除命令,
apt-get --purge remove nvidia-*

dora.內部計算 --> dora.內部計算GPU 問題記錄:
root@jjh1569:/var/log# cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
NETWORK:HOST
修改爲:
NETWORK:HOST;GPU_MODEL:QSV
之後重啓dockerd和mesos-agent服務
發現啓動mesos-agent服務失敗

剛纔那個mesos-agent問題,是配置不一致,導致的啓動失敗(mesos-agent會保持重連機制,配置不同會失敗)
刪除work目錄,/disk1/mesos

root@jjh1569:/var/log# cd /home/qboxserver/mesos-agent/current/conf/mesos-agent/
root@jjh1569:/home/qboxserver/mesos-agent/current/conf/mesos-agent# cat work_dir
/disk1/mesos
然後執行:
rm -rf /disk1/mesos
root@jjh1569:/var/log# less syslog
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627574  9662 slave.cpp:519] Agent resources: cpus(*):7; mem(*):12288; disk(*):445440; ports(*):[10000-20000]
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627622  9662 slave.cpp:527] Agent attributes: [ NETWORK=HOST, GPU_MODEL=QSV ]
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627645  9662 slave.cpp:532] Agent hostname: 10.20.78.29
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.630751  9660 state.cpp:57] Recovering state from '/disk1/mesos/meta'
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Failed to perform recovery: Incompatible agent info detected.

Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Old agent info:
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   name: "NETWORK"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:     value: "HOST"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }

Oct 24 18:55:11 jjh1569 mesos-agent[9615]: New agent info:
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   name: "NETWORK"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:     value: "HOST"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   name: "GPU_MODEL"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:     value: "QSV" #多出的一部分
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }

然後修改attributes和resources(QSV是自定義的gpu類型,gpus是GPU個數,需要對應修改)
再重啓dockerd和mesos-agent服務(如果啓動失敗,刪除workdir: /disk1/mesos目錄再重啓mesos-agent)
#!/bin/bash
if grep -q QSV /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
then echo QSV is exit
else
sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:QSV/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
fi

/home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
[
{
"name": "cpus",
"type": "SCALAR",
"scalar": {
"value": 7
}
},
{
"name": "mem",
"type": "SCALAR",
"scalar": {
"value": 14336
}
},
{
"name" : "disk",
"type" : "SCALAR",
"scalar" : { "value" : 20480 }
},
{
"name": "ports",
"type": "RANGES",
"ranges": {
"range": [
{
"begin": 10000,
"end": 20000
}
]
}
},
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 1
}
},
{
"name": "gpuset",
"type": "SET",
"set": {
"item": ["0"]
}
}
]
EOF

gpu插件相關腳本:

root@xs313:~# cat /tmp/gpu.sh
#!/bin/bash
#usage: 部署 dora gpu 機器 gpu 相關配置的腳本

supervisorctl stop mesos-agent
supervisorctl stop boots-cadvisor
supervisorctl stop dockerd

#安裝自定義 cadviser

cd /home/qboxserver/boots-cadvisor/current/bin
mv cadvisor cadvisor.bak
wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor
chmod +x cadvisor
chown qboxserver:qboxserver cadvisor

#安裝自定義的 mesos-docker-executor

cd /home/qboxserver/mesos-agent/current/libexec/mesos
wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2018-09-10-15-05-00
mv mesos-docker-executor mesos-docker-executor.bak
mv mesos-docker-executor-2018-09-10-15-05-00 mesos-docker-executor
chown qboxserver:qboxserver mesos-docker-executor
chmod +x mesos-docker-executor

#meos-agent 參數

#Part #1** 修改 attributes

MODEL=$(nvidia-smi -L | cut -d" " -f4 | xargs | cut -d" " -f1)
sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:${MODEL}/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
nvidia-smi -L

#Part #2** 添加 isolation
echo "cgroups/devices,gpu/nvidia" &gt; /home/qboxserver/mesos-agent/current/conf/mesos-agent/isolation

#Part #3** 添加 nvidia_gpu_devices
echo "0,1,2,3,4,5,6,7" &gt; /home/qboxserver/mesos-agent/current/conf/mesos-agent/nvidia_gpu_devices

#Part #4** 添加 resources

for i in `seq 2`; do sed -i '$d' /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources ; done
cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
},
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 8
}
},
{
"name": "gpuset",
"type": "SET",
"set": {
"item": ["0", "1", "2", "3", "4", "5", "6", "7"]
}
}
]
EOF

#安裝 nvidia-docker-plugin

cd /home/qboxserver && mkdir nvidia-docker && cd nvidia-docker
wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
ln -s 2016-11-22-20-45-30 current
./current/bin/start.sh

#最後上線

rm -rf $(cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir)
supervisorctl start dockerd
supervisorctl start mesos-agent
supervisorctl start boots-cadvisor

查看nvidia顯卡驅動
目前dora使用的gpu有k80和p4兩種類型,查看方法:

nvidia-smi -L
root@xs991:~#  nvidia-smi -L
GPU 0: Tesla P4 (UUID: GPU-50850be7-c49e-4693-e20e-a677d2adeb82)
GPU 1: Tesla P4 (UUID: GPU-22e9fbe2-9170-4548-c301-579b786858b6)
GPU 2: Tesla P4 (UUID: GPU-c8132e0e-c8a4-defc-fea3-01b5c930667e)
GPU 3: Tesla P4 (UUID: GPU-762546f1-0b48-c963-954e-fa74b4f7e76f)
GPU 4: Tesla P4 (UUID: GPU-2fdb3d5e-dd66-1f6d-a814-5265df4fa1f4)
GPU 5: Tesla P4 (UUID: GPU-a4011f72-78c2-ab13-c6b8-3e58e9093773)
GPU 6: Tesla P4 (UUID: GPU-84d2bbd4-c3e0-d7ed-6628-5528878de6ea)
GPU 7: Tesla P4 (UUID: GPU-fa3933c0-3cb3-4e8c-a84a-75342a15cc24)

root@xs313:~# nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-a457c419-bcfd-538b-d993-e443d28dcd24)
GPU 1: Tesla K80 (UUID: GPU-07f9795d-3917-b804-a6c5-621e27c239f8)
GPU 2: Tesla K80 (UUID: GPU-78197899-b007-1e74-29a8-3f27958e7d28)
GPU 3: Tesla K80 (UUID: GPU-d594f478-261b-e139-b87f-cf1d7b076f42)
GPU 4: Tesla K80 (UUID: GPU-8df7cf81-e51a-3a88-a4b8-6075d18a9365)
GPU 5: Tesla K80 (UUID: GPU-c9931f33-32c0-da73-aa8f-6109989b129c)
GPU 6: Tesla K80 (UUID: GPU-0830ceaa-f860-b717-67ac-e4e7fec25a26)
GPU 7: Tesla K80 (UUID: GPU-9b509b1c-a186-cf05-8aa3-4ba73aed1eb1)

顯卡有nvidia和Intel集成兩種類型

root@xsgpu81:~# lspci | grep -i nvidia
04:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
05:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
08:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
09:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
84:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
85:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
88:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
89:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)

qboxserver@jjh1569:~$ lspci | grep -i vga
00:13.0 Non-VGA unclassified device: Intel Corporation Sunrise Point-H Integrated Sensor Hub (rev 31)
07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章