雲環境中GPU配置

Ocata週期的科學技術重點之一是進一步擴展OpenStack中GPU支持的狀態。這裏的第一個問題是我們在討論GPU支持時正在討論的問題,因爲使用現有的OpenStack功能(例如,Nova的PCI直通支持)已經有幾種可能性和組合,允許部署者利用GPU拼湊雲。這裏的想法是讓我們瞭解儘可能多的可能性,同時深入瞭解社區經驗支持它的細節。

GPU計算節點就像常規計算節點,除了它們包含一個或多個GPU卡。這些卡是以某種方式配置的他們可以傳遞給實例。然後,該實例可以將GPU卡用於計算或加速圖形工作。

step 1: ENABLE MEMORY MANAGEMENT

To ensure the devices perform in a virtualised / passthrough environment we need to enable IOMMU within the GPU server. IOMMU (I/O Memory Management Unit) is a feature supported by motherboard chipsets that provides enhanced virtual-to-physical memory mapping capabilities, including the ability to map large portions of non-contiguous memory. IOMMU can be enabled in the motherboard’s BIOS, please refer to your server provider for instructions on how to ensure this is set. Once this is applied you can pass a boot parameter (intel_iommu=on) to the kernel at boot time to ensure its enabled and turned on within the OS.

Ensure Grub is configured

[root@gpu ~]# grep intel_iommu=on /boot/grub2/grub.cfg | head -n 1
linux16 /vmlinuz-3.10.0-327.18.2.el7.x86_64 root=UUID=33011dab-c75a-45d0-b7a2-ae23545c850f ro quiet rdblacklist=nouveau intel_iommu=on

Then verify IOMMU is enabled when the system is booted up in dmesg

[root@gpu ~]# dmesg | grep -iE "dmar|iommu" | grep -i enabled 
[ 0.000000] Intel-IOMMU: enabled 

step 2: GET THE GPU IDS FOR THE NOVA CONFIGURATION

The first thing we will need to capture is the vendor ID and device ID from the host system of the GPUs we want to pass through. In the example below we are using 1x NVIDIA Tesla K80 and 1x NVIDIA GRID K2

check for the nvidia GPUs

[root@gpu ~]# lspci | grep NVIDIA 
04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 
05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 
83:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1) 
84:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1) 

get their IDs with the -n flag

[root@gpu ~]# lspci -nn | grep NVIDIA 
04:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1) 05:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1) 83:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK104GL [GRID K2] [10de:11bf] (rev a1) 
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK104GL [GRID K2] [10de:11bf] (rev a1) 

Based on the above we now have: 1) The vendor id is: 10de (Which corresponds to NVIDIA) 2) And the product id is: 102d and 11bf (Where 102d is the K80 and 11bf is the GRID K2)

step 3: CONFIGURE THE GPU NOVA SERVICE

step 3.1: CONFIGURE THE COMPUTER(S) TO BE GPU AWARE

The next step is to configure the GPU server Nova Service with the appropriate passthrough flags:

file: /etc/nova/nova.conf / On the GPU server

[DEFAULT] 
... 
pci_passthrough_whitelist={"vendor_id":"10de","product_id":"102d"}

Restart the nova service

[root@gpu ~]# systemctl restart openstack-nova-compute 
[root@gpu ~]#

step 3.2: CONFIGURE THE CONTROLLER(S) TO BE GPU AWARE

We then need to add the PCI configuration and resource scheduling parameters to the nova.conf on the controller(s)

In the file: /etc/nova/nova.conf on the controller

[DEFAULT]
... 
pci_alias={"name":"K80_Tesla","vendor_id":"10de","product_id":"102d"} 
scheduler_available_filters=nova.scheduler.filters.all_filters 
scheduler_available_filters=nova.scheduler.filters.pci_passthrough_filter.PciPassthroughFilter 
scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter 
scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler scheduler_max_attempts=5 

設定使用pci filter,是爲了防止節點所有pci設備被分配完畢之後,該節點再次接收到創建pci設備的vm的請求,導致vm創建錯誤的問題

Restart all the nova services on the controller(s) to enable the configuration.

step 4: SETUP THE GPU FLAVOR IN OPENSTACK

Replace with whatever CPU/RAM requirements are appropriate for your environment

openstack flavor create --public --ram 2048 --disk 20 --vcpus 2 m1.large.2xK80 

Add a passthrough property for the flavor

openstack flavor set m1.large.2xK80 --property pci_passthrough:alias='K80_Tesla:2' 

Then once created you should see something like this:

openstack flavor show m1.large.1xK80   

Field   Value
OS-FLV-DISABLED:disabled    False
OS-FLV-EXT-DATA:ephemeral   0
disk    20
id  481b0dd4-1148-4714-8f0b-fceb8aed884f
name    m1.large.2xK80
os-flavor-access:is_public  True
properties  pci_passthrough:alias='K80_Tesla:2'
ram 2048
rxtx_factor 1.0
swap     
vcpus   2

step 5: SPIN UP AN INSTANCE WITH YOUR NEW GPU FLAVOR

Spin up an instance with CLI/GUI and ssh in, you’ll hopefully see

# yum install pciutils if a barebone centos 7 
[root@gpupass1 ~]# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 
00:02.0 VGA compatible controller: Cirrus Logic GD 5446 
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device 
00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 
00:05.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) # <----- BOOM! 
00:06.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 

可能會出現的問題說明

對於GPU透傳的虛擬機進行遷移可能會出現問題,若遷移的目標主機不支持GPU透傳則遷移會失敗。

NEXT STEPS

  • Performance Analysis (Performance evaluation of the GPUs within a virtualised environment in comparison to a bare metal environment.)
  • GPU to GPU performance within a VM
  • GPU to GPU performance across nodes (SR-IOV on Mellanox Fabric)
  • P100 cards to be added to lab environment shortly.

The GPU exclusive hypervisor instance scheduling problem

Regarding scheduling, the problem is that if you allow both GPU and non-GPU flavor instances to use a GPU enabled compute node, i.e., you don’t explicitly setup aggregates that stop this, then you can end up with your nice expensive GPU node/s being packed full of regular instances and no way to get any workload onto the 15-30k worth of GPU! You can always use aggregates to stop that from occurring, but then you end up with the reverse problem - you might have 4-8 GPUs you can assign/passthrough to VMs per compute node, and likelihood is that you have different CPU+Memory configs for these flavors, so you’ll end up with all GPUs assigned but a reasonable amount of free CPU and Memory resource that could be used for other non-GPU instances.

所以這是我希望找到一個解決方法,爲什麼我以前討論過調度程序“耗材”的概念,也就是說,計算主機上一個任意的方式來解釋事物。我可以想象一個主機配置或聚合元設置,如consumerable_ :,具有匹配的風格設置和維護每個主機的耗材的工作值的調度程序過濾器。例如,對於GPU節點使用情況,可以簡單地將主機設置爲consumable_regularflavor = 4,並且所有具有metadata regularflavor = 1的非GPU風格,那麼調度程序將允許每個GPU節點多達4個非GPU實例,但是更多。這是一個粗略而比較幼稚的例子,忽略了邊緣的情況

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章