環境 win10, visual studio 2019, pycuda 2019.02,
寫一段 cuda c 代碼, 提交給一個構造函數
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
a = np.random.randn(4,4)
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
運行一個內核函數(kernel)
寫一個代碼來把a_gpu這段顯存中存儲的數組的每一個值都乘以2. 爲了實現這個效果,我們就要寫一段CUDA C代碼,然後把這段代碼提交給一個構造函數,這裏用到了pycuda.compiler.SourceModule
:
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
pycuda CompileError: nvcc compilation failed
如果這一步出錯CompileError: nvcc compilation of C:\Users\KANGNI~1\AppData\Local\Temp\tmpbregy28e\kernel.cu failed
可能是環境配置的問題,嘗試在代碼中加入如下片段
import os
_path = r"D:\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\Hostx64\x64"
if os.system("cl.exe"):
os.environ['PATH'] += ';' + _path
if os.system("cl.exe"):
raise RuntimeError("cl.exe still not found, path probably incorrect")
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
這一步如果沒有出錯,就說明這段代碼已經編譯成功,並且加載到顯卡中。然後咱們可以使用pycuda.driver.Function
,然後調用此引用,把顯存中的數組a_gpu作爲參數傳過去,同時設定塊大小爲4x4:
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
a_doubled = np.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)
print(a)
result:
[[ 1.4142118 1.48865 3.0958736 -0.64879215]
[ 0.5829473 1.684329 0.21461935 -2.3383026 ]
[ 1.2626396 -0.7566854 1.8427325 -0.52471924]
[-0.80164 1.3886924 -3.5787368 0.72956717]]
[[ 1.4142118 1.48865 3.0958736 -0.64879215]
[ 0.5829473 1.684329 0.21461935 -2.3383026 ]
[ 1.2626396 -0.7566854 1.8427325 -0.52471924]
[-0.80164 1.3886924 -3.5787368 0.72956717]]
ref: http://www.voidcn.com/article/p-zoelqetb-bvv.html