TensorFlow搭建的模型,等我按照要求將對應版本的TensorFlow和Keras安裝好之后,發(fā)現訓練模型巨慢,GPU顯存只用了一點(diǎn)點(diǎn)而且利用率一直是零,而且提示找不到一些庫,提示如下。2022-06-10 13:06:14.299058: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64
2022-06-10 13:06:14.299110: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64
2022-06-10 13:06:14.299155: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64
2022-06-10 13:06:14.299198: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64
2022-06-10 13:06:14.299239: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64
2022-06-10 13:06:14.299281: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64
2022-06-10 13:06:14.299326: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64
2022-06-10 13:06:14.299336: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2022-06-10 13:06:14.299421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
CUDA和cuDNN版本沒(méi)有裝合適,因為該程序會(huì )去/usr/local/cuda-10.0/lib64文件夾下找庫,但是我就沒(méi)有裝CUDA 10.0。去網(wǎng)上找了一番資料后,筆者發(fā)現果然是CUDA和cuDNN的版本問(wèn)題,TensorFlow版本與CUDA版本居然也有對應關(guān)系,這下讓我更加覺(jué)得TensorFlow不好用了。但是這臺機器也不是筆者獨占的,而且機器上已經(jīng)有裝好的CUDA 11.2和cuDNN 8.4.0了,這種情況確實(shí)讓人抓狂,不過(guò)在筆者瀏覽了浩瀚的因特耐特之后,發(fā)現居然有一種多版本CUDA共存和自由切換的操作,現將該技術(shù)整理如下。任務(wù)描述:在一臺安裝了CUDA 11.2和cuDNN 8.4.0的機器上安裝CUDA 10.0和cuDNN 7.4.1,使得兩者互不干擾和自由切換。CUDA和cuDNN的版本選擇參考這篇博客。查看已有CUDA環(huán)境

從官網(wǎng)下載CUDA 10.0的runfile到服務(wù)器上。

安裝CUDA 10.0
執行如下指令
sudo sh cuda_10.0.130_410.48_linux.run
出現協(xié)議說(shuō)明,可以按q跳過(guò)。

- 出現問(wèn)題`Do you accept the previously read EULA?`
- 輸入`accept`+回車(chē),繼續安裝。
- 出現不支持配置的提醒:`You are attempting to install on an unsupported configuration. Do you wish to continue?`
- 輸入`y`,繼續安裝。
- 出現是否安裝顯卡驅動(dòng)的提醒,我們已經(jīng)裝過(guò)了:`Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?`
- 輸入`n`,繼續安裝。
- 出現是否安裝CUDA工具包:`Install the CUDA 10.0 Toolkit?`
- 輸入`y`,開(kāi)始安裝。
- 出現工具包安裝地址:`Enter Toolkit Location`
- 回車(chē)
- 出現是否添加符號鏈接,現在已經(jīng)有一個(gè)了,為了不影響現有的CUDA環(huán)境,選擇否:`Do you want to install a symbolic link at /usr/local/cuda?`
- 輸入`n`,繼續安裝。
- 出現是否安裝樣例,選擇是:`Install the CUDA 10.0 Samples?`
- 輸入`y`,繼續安裝
- 出現安裝樣例位置,默認即可:`Enter CUDA Samples Location`
- 回車(chē)
不出意外此時(shí)應該安裝完成,但如果此時(shí)你也出現Error: unsupported compiler: 9.4.0. Use --override to override this check.報錯,我們按照他說(shuō)的加上--override選項跳過(guò)檢查。

執行新的指令,選項和上圖一致:
sudo sh cuda_10.0.130_410.48_linux.run --override
安裝成功會(huì )出現以下提示:

根據安裝的CUDA工具包版本在官網(wǎng)選擇適合版本的cuDNN,本文安裝的CUDA版本是10.0,就選擇TensorFlow 1.14.0對應的cuDNN 7.4.1,選擇Local Installer for Linux x86_64 (Tar)。

復制cuDNN庫的鏈接,使用wget下載或者下載到自己電腦之后再傳到服務(wù)器上。
下載下來(lái)之后,文件名是cudnn-10.0-linux-x64-v7.4.1.5.solitairetheme8,需要重命名一下,改成cudnn-10.0-linux-x64-v7.4.1.5.tgz:
mv cudnn-10.0-linux-x64-v7.4.1.5.solitairetheme8 cudnn-10.0-linux-x64-v7.4.1.5.tgz
解壓cuDNN文件,并進(jìn)入解壓出的文件夾,拷貝文件到/usr/local/cuda-10.0中。
tar -xvf cudnn-10.0-linux-x64-v7.4.1.5.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda-10.0/lib64/
sudo cp include/* /usr/local/cuda-10.0/include/
sudo chmod a+r /usr/local/cuda-10.0/lib64/*
sudo chmod a+r /usr/local/cuda-10.0/include/*
查看cuDNN版本,指令為cat /usr/local/cuda-10.0/include/cudnn.h | grep CUDNN_MAJOR -A2。

更新軟鏈接,如果你安裝的不是7.4.1記得更新下邊命令中的數字。
cd /usr/local/cuda-10.0/lib64/
sudo rm -rf libcudnn.so libcudnn.so.7
sudo ln -s libcudnn.so.7.4.1 libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so
sudo ldconfig -v
最后避免影響到原來(lái)的CUDA環(huán)境,再執行一下
source /etc/profile
此時(shí)另一個(gè)版本的CUDA和cuDNN已經(jīng)“偷偷”安裝好了。
但是此時(shí)nvcc -V版本還是11.2,具體怎么實(shí)現CUDA版本轉換,請看下節。
11.2。
#!/usr/bin/env bash
# Copyright (c) 2018 Patrick Hohenecker
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# author: Patrick Hohenecker <mail@paho.at>
# version: 2018.1
# date: May 15, 2018
set -e
# ensure that the script has been sourced rather than just executed
if [[ "${BASH_SOURCE[0]}" = "${0}" ]]; then
echo "Please use 'source' to execute switch-cuda.sh!"
exit 1
fi
INSTALL_FOLDER="/usr/local" # the location to look for CUDA installations at
TARGET_VERSION=${1} # the target CUDA version to switch to (if provided)
# if no version to switch to has been provided, then just print all available CUDA installations
if [[ -z ${TARGET_VERSION} ]]; then
echo "The following CUDA installations have been found (in '${INSTALL_FOLDER}'):"
ls -l "${INSTALL_FOLDER}" | egrep -o "cuda-[0-9]+\\.[0-9]+$" | while read -r line; do
echo "* ${line}"
done
set +e
return
# otherwise, check whether there is an installation of the requested CUDA version
elif [[ ! -d "${INSTALL_FOLDER}/cuda-${TARGET_VERSION}" ]]; then
echo "No installation of CUDA ${TARGET_VERSION} has been found!"
set +e
return
fi
# the path of the installation to use
cuda_path="${INSTALL_FOLDER}/cuda-${TARGET_VERSION}"
# filter out those CUDA entries from the PATH that are not needed anymore
path_elements=(${PATH//:/ })
new_path="${cuda_path}/bin"
for p in "${path_elements[@]}"; do
if [[ ! ${p} =~ ^${INSTALL_FOLDER}/cuda ]]; then
new_path="${new_path}:${p}"
fi
done
# filter out those CUDA entries from the LD_LIBRARY_PATH that are not needed anymore
ld_path_elements=(${LD_LIBRARY_PATH//:/ })
new_ld_path="${cuda_path}/lib64:${cuda_path}/extras/CUPTI/lib64"
for p in "${ld_path_elements[@]}"; do
if [[ ! ${p} =~ ^${INSTALL_FOLDER}/cuda ]]; then
new_ld_path="${new_ld_path}:${p}"
fi
done
# update environment variables
export CUDA_HOME="${cuda_path}"
export CUDA_ROOT="${cuda_path}"
export LD_LIBRARY_PATH="${new_ld_path}"
export PATH="${new_path}"
echo "Switched to CUDA ${TARGET_VERSION}."
set +e
return
switch-cuda.sh文件,將上邊代碼寫(xiě)入; vi switch-cuda.sh
source switch-cuda.sh
source switch-cuda.sh 10.0

source switch-cuda.sh的時(shí)候該腳本會(huì )掃描所有已安裝的CUDA,并列出,用戶(hù)只需要選擇想用的CUDA版本號就可以輕松切換,例如source switch-cuda.sh 10.0,可以看到上圖的nvcc也是成功切換了版本。export 語(yǔ)句,重啟終端后,CUDA環(huán)境還是會(huì )恢復到默認的11.2,不影響下次使用,無(wú)需手動(dòng)切回CUDA版本,下圖為重啟終端后的效果。
以上就是今天要講的內容,本文介紹了如何在一臺機器上同時(shí)安裝多個(gè)版本的CUDA,并且介紹了一種簡(jiǎn)便切換CUDA版本的操作。
如果本文能給你帶來(lái)幫助的話(huà),點(diǎn)個(gè)贊鼓勵一下作者吧!
聯(lián)系客服