一張PPT看懂vSAN

vSAN和ESXi的關係怎樣

選擇虛擬化管程序的原因:

超過 70% 的 x86 服務作負載實施虛擬化1本身可持應

剛好位於 I/O 徑中可提供底層存儲資源的全局視圖

它與硬件無關

VM and ESXi

Virtual SAN 已嵌 vSphere 內核CPU 佔少於 10%

提供最短的 I/O 徑

與 vSphere 和 VMware 產品體系縫集成

軟件定義的存儲針對虛擬機進優化聚合虛擬化管程序的體系結構可在任何標準 x86 服務上運行；
將 HDD/SSD 池化爲共享數據存儲提供企業級的可擴展性和性能通過按虛擬機設置的存儲策進管與 VMware 產品體系深度集成

vSAN物數據怎麼放-DiskGroup

每臺主機都包含爲 vSAN 分佈式數據存儲貢獻緩存和容量的閃存設備(全閃存配置)或磁盤和閃存設備的組合(混合配置)。每臺主機都有一到五個磁盤組。每個磁盤組都包含一個緩存設備和一到七個容量設備。

VSAN不僅支持分佈式存儲的在線橫向擴展(Scale Out)，也支持縱向擴展(Scale Up) 通過增加主機，提供存儲容量的vsanDatastore可以在線擴大，同時整體的性能也線性增長。

ssd endurance classes

ssd performance classes

magnetic disk classes

在全閃存配置中，緩存層中的閃存設備用於緩衝寫入內容。不需要讀取緩存，因爲容量閃存設備的性能已綽綽有餘。全閃存 vSAN 配置中通常使用兩個等級的閃存設備: 用於緩存層的容量較低、耐久性較高的設備，以及用於容量層的更經濟高效、容量較高、耐久性較低的設備。寫入在緩存層執行，然後根據需要轉儲至容量層。這有助於保持性能，同時延長容量層中耐久性較低的閃存設備的使用壽命。

在混合配置中，一個閃存設備和一個或多個磁盤配置爲一個磁盤組。一個磁盤組最多可以有七個提供容量的驅動器。vSphere 主機中使用一個或多個磁盤組，具體取決於主機中包含的閃存設備和磁盤的數量。閃存設備充當 vSAN 數據存儲的讀緩存和寫緩衝區，而磁盤構成數據存儲的容量。vSAN 將使用 70% 的閃存容量作爲讀緩存，使用 30% 作爲寫緩存。

vSAN存啥數據-object

在vSAN中最典型的存儲塊設備就是獨立的VMDK、虛擬機主頁名字空間和虛擬機交換文件。當然，如果虛擬機拍過快照，則還會創建一個

增量盤對象。如果快照包含有虛擬機的內存，這也會被實例化成一個對象。

vSAN 是一種對象數據存儲，主要是由對象和容器(文件夾)組成的扁平分層結構。組成虛擬機的項目表示爲對象。以下是您在 vSAN 數據存儲上可以看到的最常見對象類型:

虛擬機主目錄，其中包含虛擬機配置文件和日誌，如 VMX 文件

虛擬機交換文件

虛擬磁盤 (VMDK)

增量磁盤(快照)

性能數據庫
vSAN 數據存儲上還有一些其他常見對象，例如 vSAN 性能服務數據庫、內存快照增量以及屬於 iSCSI 目標的 VMDK。

vSAN的object怎麼切-RAID tree

每個對象包含一個或多個組件。組成對象的組件數目主要取決於以下兩個因素:對象的大小以及分配給該對象的存儲策略。

host3組件爲見證組件，由 vSAN 創建，在兩臺主機之間發生網絡分裂時它可以“打破平衡”並實現仲裁。見證對象將安置在第三臺主機上。

vSAN的raid tree

組件是對象的RAID樹上的葉，分佈在VSAN集羣中的各個主機上。其實，組件是按照兩種主要的技術分佈的:Striping(條帶)，即RAID 0;和Mirroring(鏡像)，即RAID 1。簡之，條帶即組件。

vSAN 存儲策略有哪些特點

VMware 的 Storage Policy-Based Management (SPBM) 可實現存儲服務的精確控制。與其他存儲解決方案一樣，vSAN 可提供可用性級別、容量消耗和性能條帶寬度等服務。一條存儲策略可包含一個或多個用於定義服務級別的規則。
可以使用新的 vSphere Client、舊版(“Flex”)vSphere Web Client 或通過 PowerCLI/API 創建和管理存儲策略。策略可以分配給虛擬機和個別對象，如虛擬磁盤。應用需求發生變化時，可輕鬆更改或重新分配存儲策略。無需停機，也無需在數據存儲之間遷移虛擬機，即可執行這些修改。SPBM 允許以虛擬機爲單位精確地分配和修改服務級別。

vSAN software components

Local Log Structured Object Management - LSOM
LSOM works at the physical disk level, both 4lash devices and magnetic disks. It handles the physical storage for Virtual SAN components on the local disks and

the read caching and write buffering for the components.

Distributed Object Manager - DOM

DOM is responsible for the creation of virtual machine storage objects from local components across multiple ESXi hosts in the Virtual SAN cluster by implementing distributed RAID. It is also responsible for providing distributed data access paths to these objects. There are 3 roles within DOM; client, owner and component manager.

Client: Provides access to an object. There may be multiple clients per object depending on access mode.

Owner: Coordinates access to the object, including locking and object con5iguration and recon5iguration. There is a single DOM owner per object. All objects changes and writes go through the owner. Typically the client and owner will reside on the same host, but this is not guaranteed and they may reside on different hosts.

Component Manager: Interface for LSOM and the physical disks. A node’s DOM may play any of the three roles for a single object

Cluster Level Object Manager - CLOM
CLOM ensures that an object has a con4iguration that matches its policy, i.e. stripe width or failures to tolerate, to meet the requirements of the virtual machine. Each ESXi host in a Virtual SAN cluster runs an instance of clomd, which is responsible for the policy compliance of the objects. CLOM can be thought of as being responsible for the placement of objects and their components.

CMMDS discovers, establishes and maintains a cluster of networked node members, It manages the inventory of items such as Nodes, Devices, Networks and stores metadata information such as policies, distributed RAID con4iguration, etc.

Reliable Datagram Transport - RDT

RDT, the reliable datagram transport, is the communication mechanism within Virtual SAN. It is uses TCP at the transport layer and it is responsible for creating and destroying TCP connections (sockets) on demand.

witness機制

在RAID1有兩個副本的情況下，如果主機之間失聯，將無法分辨這到底是主機故障還是網絡分區的情況。因此，需要在配置中引入一個第三方，這就是見證。vSAN中的一個對象要被認定爲可用，必須滿足以下兩個條件:

1. RAID樹必須允許數據訪問(RAID-1必須至少有一個完好的副本，RAID-0必須所有的條帶都完好)。對於RAID-5和RAID-6配置來說，RAID-5要求4個組件中必須有3個可用，而RAID-6則是6個組件中必須有4個可用。

2. 在vSAN的早期版本中，規則是必須有超過50%的組件可用。從vSAN 6.0開始，引入了和組件相關聯的投票(vote) ，規則被更改爲投票至少要超過50%。

在前面的例子中，只有當能同時訪問到一個副本和一個見證，或者同時訪問到兩個
副本(無見證)的時候，才能夠訪問這個對象。這樣，在出現網絡分區的情況下，
至少有部分羣集可以訪問這個對象。

witness大小通常爲2M左右，裏面存放着對象的meta數據，當任意一個節點發生故障時，剩餘節點仍然可以繼續提供服務。但經常我們會發現創建完vSan後witness數

量不止一個，這就要從witness的組件定義說起，witness按照組件定義可以分爲三種:

1.primarywitness，當主機節點數不滿足storagepolice時，纔會出現該witness。舉例說明，當FTT=2時，按照要求此時至少需要5臺host，當前環境中的host主機只有4臺

這時就會出現primary witness，當環境中滿足5臺host後，primary witness就會消失。

2.secondary witness，當故障發生後剩餘的節點會產生選舉，確定出哪一個新的節點承載原有節點上的active對象，但每一個host主機上所承載的對象總數不會相同，

時的選舉就處於一種不公平的狀態，secondary witness就是爲了避免該狀態的產生，讓每一個host主機上的對象數量相同(只是對象的數量，而不管對象的大小)。要注意

的是，secondary witness是爲了保證已經承載有對象

3.tiebreaker witness，當進行完上述兩步之後，爲了保證總對象數量爲奇數，此時會添加一個tiebreaker witness

組件的主機之間的組件數一致，不是羣集中所有ESXi主機，得知esxi-01就不會產生witness組件。

vSan6.0中每個對象的最大值爲255G，所以在此處會將wmdk強行分割成2個對象，多餘的1G被meta數據融合，於是整個raid1中就存在4個對象。此時要求至少需要3個節點，當前環境有4個host主機，所以primary witness就不會出現，而每個host上都只有一個對象，secondary witness也不會出現，所以此時只會看到1個tiebreaker witness。

vmdk被分割成了3個對象，從raid0上能看出esxi60與esxi80上各有2個對象，esxi50和esxi70上只有一個對象，所以坐在esxi50和esxi70上各生成一個secondary witness，從而使每個host上的對象數量一致，然後又因爲此時的對象總數是8個，所以還會再生成一個tiebreaker witness對象用於保證總數爲奇數，此時看到的witness總數就爲3個。

vSAN怎麼提空間率

當一個磁盤的可用空間低於 20% 時，vSAN 將自動嘗試通過將該磁盤的數據移動到 vSAN 集羣中的其他磁盤來平衡容量利用率。如果存在許多大型組件，則實現磁盤容量平衡的集羣可能更困難。vSAN 6.6 將大型組件拆分爲較小組件來實現更好的平衡，從而提高效率。

提供空間率的段:unmap、deduplication and compression

副本如何重新構建和整合

智能判斷判斷繼續構建全新副本更高效還是更新重新上線的現有副本更高效，選擇最高效的方法副本整合如果故障域已包含 vSAN 組件副本，並且沒有額外容量來放置需要撤出的副本，vSAN 現在能夠將它們整合爲單個副本。最小的副本首先移動，這樣一來，數據重建量將會減少，並且臨時容量使用量也會減少。

副本如何適應重新同步

vSAN 6.7 中引入了自適應重新同步功能，可確保在 I/O 改變時爲虛擬機 I/O 和重新同步 I/O 提供公平份額的資源。

當 I/O 活動超出可持續磁盤組帶寬時，自適應重新同步可保證虛擬機 I/O 和重新同步 I/O 的帶寬級別。在無爭用期間，虛擬機 I/O 或重新同步 I/O 可以使用額外的帶寬。如果沒有執行重新同步操作，則虛擬機 I/O 可以使用 100% 的可用磁盤組帶寬。在無爭用期間，可以保證重新同步 I/O 獲得磁盤組可以使用的總帶寬的 20%。這樣可以進一步優化資源使用。

vSAN怎麼探測Degraded Device

A degraded drive is determined by measuring the average latency of the drive and detecting excessive latency for an extended period of time. A degrade drive is one where the average write IO round trip latency for four or more latency intervals distributed randomly within approximately a six hour period exceeds pre- determined latency thresholds for a drive. The magnetic drive (HDD) latency threshold is 500 milli-seconds for write IO. The flash device (SSD) latency threshold for read IO is 50 milliseconds while the IO latency for write IO is 200 milliseconds.

1. Preventative evacuation in progress. A yellow health alert is raised so that administrators know there is an issue. vSAN is proactively compensating for the degraded device by migrating all active components from degraded drive. No administrator action required.

2. Preventative evacuation is incomplete due to lack of resources, i.e., a partial evacuation of active components. A red health alert is raised to signify a more serious issue. An administrator will need to either free up existing resources, e.g., deleting unused VMs, or add resources so that vSAN can complete the evacuation. This scenario might occur when there is relatively little free capacity remaining in the cluster – yet another reason we strongly recommend keeping 25-30% free “slack space” capacity in the cluster.

3. Preventative evacuation is incomplete due to inaccessible objects. The remaining components on the drive belong to inaccessible objects. An administrator should make more resources available in an attempt to make the object accessible. The other option is to remove the drive from the vSAN configuration by choosing “no data migration” when the drive is decommissioned.

4. Evacuation complete. As you can imagine, this is the most desirable state for a drive that is in a degraded condition. All components have been migrated from the drive and all objects are accessible. It is safe to remove the drive from the vSAN configuration and replace it when convenient to do so.

according to https://blogs.vmware.com/virtualblocks/2018/05/25/vsan-degraded-device-handling

IO 流程-寫操作

Guest OS issues write operation to virtual disk

Owner clones write operation

In parallel: sends “prepare” operation to H1 (locally) and H2

H1, H2 persist write operation to Flash (log)

H1, H2 Acknowledge prepare operation to owner

Owner waits for ACK from both ‘prepares’ and completes I/O.

Later, the owner commits a batch of writes to hard disk or 4lash used as capacity.

IO 流程-讀操作

The Guest OS issues a read request from disk

Owner chooses which mirror copy to read from. The owner of the storage

object will try load balance reads across replicas and may not necessarily read from the

local replica (if one exists). On Virtual SAN, a block of data is always read from same mirror which means that the data block is cached on at most on one 4lash device (SSD); this maximize

effectiveness of Virtual SAN’s caching

At chosen replica (H2): read data from read cache, if it exists.

Otherwise, we incur a read cache miss so we must read from magnetic disk and placed in the read cache

Return data to owner

Owner completes read operation and returns data to VM

cache 策數據下刷週期: 適應算法，

綜合考慮包括rate of incoming I/O, queues, disk utilization,

and optimal batching

把寫到同一個磁盤上的數據成批攢夠才下刷

當flash還有大量空間的時候不刷 (避免對磁盤同一個地方的反覆寫)

vSAN寫ssd cache的block size是4K，下刷數據到capacity layer 的block size 1M, 1M 也是容量層的條帶大小。

70% read buffer, 30% write cache;

write cache下刷的數據還會在read buffer裏保存一會直至不用 (新版本改進)

HDD: 聚合到分散的小寫批量寫到某個HDD 磁盤 SSD:緩存熱點數據

•

vsan-troubleshooting-reference-manual.pdf
vsan-671-administration-guide.pdf vmware-virtual-san-6.2-performance-with-online-transaction-processing-workloads.pdf

參考

https://blogs.vmware.com/vsphere/2014/04/vmware-virtual-san-witness-component-deployment-logic.html

Storage Workload Characterization and Consolidation in Virtualized Environments http://www.mamicode.com/info-detail-1181990.html

https://storagehub.vmware.com/t/vmware-vsan/vmware-vsan-6-7-technical-overview/object-rebuilding- resynchronization-consolidation-1/

一張PPT看懂vSAN

SQL優化-20231016

f2fs系列之十：f2fs到底如何避免wandering tree的？

如何計算和優化追加寫引擎中GC的寫放大

page cache的淘汰策略和組織形式

適配SSD介質的存儲引擎的GC的思考

FIO性能測試數據畫圖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結