在服务器上部署docker配置深度学习环境

Step 0 Install Nvidia Driver on Server

  • 首先需要确保在服务器上安装了 Nvidia 的显卡驱动,nvidia-smi 命令要能用;
  • 查看服务器显卡型号:lspci | grep -i nvidia
  • Nvidia 驱动下载地址:click

 

Step 1 Setting up Docker

  • 安装 docker: sudo apt install docker.io
  • 启动 docker: sudo systemctl start docker

 

Step 2 Setting up NVIDIA Container Toolkit

  • 在终端执行下列指令,直接复制回车:

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    	      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
    	      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    	            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    	            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    
  • 安装 Nvidia Container Toolkitsudo apt install -y nvidia-container-toolkit

  • 配置 Docker 守护进程识别 Nvidia Container Runtime :sudo nvidia-ctk runtime configure --runtime=docker

  • 重启 Docker,完成安装:sudo systemctl restart docker

  • 最后可以测试一下能否使用 cudasudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

    • 成功的话会显示如下:

    •   +-----------------------------------------------------------------------------+
        | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
        |-------------------------------+----------------------+----------------------+
        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
        |                               |                      |               MIG M. |
        |===============================+======================+======================|
        |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
        | N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
        |                               |                      |                  N/A |
        +-------------------------------+----------------------+----------------------+
      
        +-----------------------------------------------------------------------------+
        | Processes:                                                                  |
        |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
        |        ID   ID                                                   Usage      |
        |=============================================================================|
        |  No running processes found                                                 |
        +-----------------------------------------------------------------------------+
      
  • 更多细节可参考 Nvidia 官方文档

 

Step 3 Pull Image

  • 下载配有 pytorch、cuda、cudnn 的镜像:docker pull hambaobao/dlb:pytorch2.0.0-cuda11.7-cudnn8-v1.0
  • 可以自己下载 Pytorch 或者 Nvidia 提供的更原始的镜像使用:

 

Step 4 Create a container

  • exampledocker run -itd --name s87d --hostname s87d --gpus all -v /data0:/data0 -v /data1:/data1 -v /data2:/data2 -v /home/:/home/ -p 2094:2094 -p 4382:4382 hambaobao/dlb:pytorch2.0.0-cuda11.7-cudnn8-v1.0
    • --name:容器的名字;
    • --hostname:容器内的主机名;
    • --gpus!!!非常关键!!!,没有这个选项容器内无法访问 GPUs
    • -v:挂载目录,建议挂载所有 data 目录和 home 目录;
    • -p:端口映射,用于 sshproxy control
    • 一个最后是使用的镜像;