虛擬化技術理解
Created Wednesday 05 March 2014
- 虛擬機監控程序 : Virtual Machine Monitor 簡稱VMM
- 虛擬化最大優勢:
- 將不同的應用程序或者服務器運行在不同的虛擬機中, 可以不同不同程序間的相互干擾
- 便於維護, 降低成本
- migration, 保證正常運行
- 方便研究
- Xen的應用範圍(from Xen詳解.pdf)
- 服務器整合:在虛擬機範圍內,在一臺物理主機上安裝多個服務器, 用於演示及故障隔絕
- 無硬件依賴:允許應用程序和操作系統對新硬件的移值測試
- 多操作系統配置:以開發和測試爲目的,同時運行多個操作系統
- 內核開發:在虛擬機的沙盒中,做內核的測試和調試,無需爲了測試而單獨架設一臺獨立的
- 集羣運算: 和單獨的管理每個物理主機相比較,在 VM 級管理更加靈活,在負載均衡方面,更易於控制,和隔離
- 爲客戶操作系統提供硬件技術支持: 可以開發新的操作系統, 以得益於現存操作系統的廣泛硬件支持,比如 Linux;
- ISA: Instruction Set Architecture, 指令集
- 當虛擬的指令集與物理的指令集相同時, 可以運行沒有任何修改的操作系統, 而當兩者不完全相同時, 客戶機的操作系統就必須在源代碼級或二進制級作相應修改 (敏感指令完全屬於特權指令)
- 根據是否需要小修改操作系統源代碼, 虛擬化技術又分爲
- Paravirtualization: 泛虛擬化, 超虛擬化, 半虛擬化
- Full-virtualization: 完全虛擬化
- 事件通道(Event channel) 是Xen用於虛擬域和VMM之間, 虛擬域之間的一種異步事件通知機制. 共分8組, 每128個通道一組.
- Hypervisor服務(Hypercall), Hypercall如同操作系統下的系統調用
- 硬件虛擬化, Intel VT, AMD svm Intel-VT(vmx), AMD-v(svm) , Pass-through: VT-d, IOMMU
- Intel VT技術引入了一種新的處理器操作, 成爲VMX(Virtual Machine Extensions),
- 每一個X86客戶機的內存地址總是從0開始. 也因此監控程序必須把客戶機虛擬地址到客戶機物理地址進行重新映射. Xen泛虛擬化實現採用修改客戶機頁表的方式實現這一重新映像.
- 對CPU特權級的理解: CPU特權級的作用主要體現在2個方面:
- CPU根據當前代碼段的特權級決定代碼能執行的指令
- 特權級爲3的long jump到特權級爲0的段或訪問特權級爲0的數據段時, 會被CPU禁止, 只有int指令除外
- 敏感指令
- 敏感指令引入虛擬化後, Guest OS就不能運行在Ring 0 上. 因此, 原本需要在最高級別下執行的指令就不能直接執行, 而是交由VMM處理執行. 這部分指令稱爲敏感指令. 當執行這些指令時, 理論上都要產生trap被VMM捕獲執行.
- 敏感指令包括:
- 企圖訪問或修改虛擬機模式或機器狀態指令
- 企圖訪問或修改敏感寄存器或存儲單元, 如始終寄存器, 中斷寄存器等指令
- 企圖訪問存儲保護系統或內存, 地址分配系統的指令
- 所有的I/O指令
- 虛擬化舉例:
- 完全虛擬化
- VMware
- VirtualBox
- Virtual PC
- KVM-x86
- 半虛擬化, 剛開始爲了突破x86架構的全虛擬化限制, 後來主要是爲了提高虛擬化的效率
- Xen, KVM-PowerPC
- 完全虛擬化
- 處理器呈現給軟件的接口就是一堆的指令(指令集)和一堆的寄存器. 而I/0設備呈現給軟件的藉口也是一堆的狀態和控制寄存器(有些設備也有內部存儲). 其中影響處理器和設備狀態和行爲的寄存器成爲關鍵資源或特權資源.
- 可以讀寫系統關鍵資源的指令叫做敏感指令. 絕大多數的銘感指令是特權指令. 特權指令只能在處理器的最高特權級(內核態)執行. 對於一般RISC處理器, 敏感指令肯定是特權指令, 唯x86除外.
- 經典的虛擬化方法
經典的虛擬化方法主要使用"特權解除"(Privilege deprivileging)"和"陷入-模擬(Trap-and-Emulation)的方法. 即: 將Guest OS運行在非特權級(特權解除), 而將VMM運行於最高特權級(完全控制系統資源). 解除了Guest OS的特權後, Guest OS的大部分指令仍可在硬件上直接運行. 只有當運行到特權指令時, 纔會陷入到VMM模擬執行(陷入-模擬)
由此可引入虛擬化對體系結構(ISA)的要求:
- 須支持多個特權級
- 非敏感指令的執行結果不依賴於CPU的特權級.
- CPU需要支持一種保護機制, 如MMU, 可將物理系統和其他VM與當前活動的VM隔離
- 敏感指令需皆爲特權指令
- x86 ISA中有十多條敏感指令不是特權指令, 因此x86無法使用經典的虛擬化技術完全虛擬化.
- x86虛擬化方法
- 完全虛擬化派
- 基於二進制翻譯(BT)的完全虛擬化方法
- 基於掃描與修補的完全虛擬化方法(SUN之VirtualBox)
(2) 補丁指令塊在VMM中動態生成, 通常每一個需要修補的指令會對應一塊補丁指令
(3) 敏感指令被替換成一個外跳轉, 從VM跳到VMM, 在VMM中執行動態生成的補丁指令塊
(4) 當補丁指令塊執行完後, 執行流再跳轉回VM的下一條指令處繼續執行
- OS協助的類虛擬化派
- Operating System level virtualization
operating system as the host. This takes away on of the great benefits of virtualization
PV: Paravirtualization machine
VMM 又叫Hypervisor
- Intel和AMD的虛擬化技術
Intel VT-i: Virtualization Technology for Itanium
Intel VT-d: Virtualization Technology for Dircted I/O
AMD-v: AMD Virtualization
其基本思想就是引入新的處理器運行模式和新的指令.使得VMM和Guest OS運行在不同的模式下. Guest OS運行於受控模式下, 原來一些敏感指令在受控模式下會全部陷入VMM, 這樣就解決了部分非特權敏感指令的陷入-模擬難題, 而且模式切換時上下文的保存恢復由硬件來完成. 這樣就大大提高了陷入-模擬時的上下文切換效率.
- MMU
- 實現從頁號到物理塊號的地址映射。
- VMM結構
- 宿主模型: OS-hosted VMMS
- Hypervisor模型: Hypervisor VMMs
- 混合模型: Mybrid VMMs
- 處理器虛擬化原理精要
- TLB
TLB是一個小的,虛擬尋址的緩存,其中每一行都保存着一個由單個PTE組成的塊。如果沒有TLB,則每次取數據都需要兩次訪問內存,即查頁表獲得物理地址和取數據。
TLB:Translation lookaside buffer,即旁路轉換緩衝,或稱爲頁表緩衝;裏面存放的是一些頁表文件(虛擬地址到物理地址的轉換表)。
又稱爲快表技術。由於“頁表”存儲在主存儲器中,查詢頁表所付出的代價很大,由此產生了TLB。
X86保護模式下的尋址方式:段式邏輯地址—〉線形地址—〉頁式地址;
頁式地址=頁面起始地址+頁內偏移地址;
對應於虛擬地址:叫page(頁面);對應於物理地址:叫frame(頁框);
X86體系的系統內存裏存放了兩級頁表,第一級頁表稱爲頁目錄,第二級稱爲頁表。
TLB和CPU裏的一級、二級緩存之間不存在本質的區別,只不過前者緩存頁表數據,而後兩個緩存實際數據。
- xend服務器進程通過domain0來管理系統, xend負責管理衆多的虛擬主機, 並且提供進入這些系統的控制檯.命令經一個命令行工具通過一個http的接口被傳送到xend
- xen-3.0.3-142.el5_9.3
Description:
- Xen的網絡架構
- Xen支持3種網絡工作模式
- Bridge 安裝虛擬機時的默認模式
- Route
- NAT
- Bridge模式下, Xend啓動時的流程
- 創建虛擬網橋
- 停止物理網卡eth0
- 物理網卡eth0的MAC地址和IP地址被複制到虛擬網卡veth0
- 物理網卡eth0重命名爲peth0
- veth0重命名爲eth0
- peth0的MAC地址更改, ARP功能關閉
- 鏈接peth0, vif0.0到網橋xenbr0
- 啓動peth0, vif0.0, xenbr0
- Xen支持3種網絡工作模式
- Xen blktap
- blktap的工作流程
- (XEN) 一般來說 xend 啓動執行 network-bridge 腳本會線把 eth0 的 IP 和 MAC 地址複製給虛擬網絡接口 veth0,然後再把真實的 eth0 重命名爲 peth0,把虛擬的 veth0 重命名爲 eth0
- 字符設備
- Xend is responsible for managing virtual machines and providing access to their consoles
- /etc/xen/xmexample1 is a simple template configuration file for describing a single VM
- /etc/xen/xmexample2 file is a templete description that is intended to be reused for multiple virtual machines.
- For xen
- service xendomains start will start the configure file which locate in /etc/xen/auto directory
- hypervisor will automatic call xendomains to start the gust which locate in /etc/xen/auto directory
- In fact xendomains call xm to start|stop a xen guest
- For xen
- network-bridge: This script is called whenever xend is started or stopped to respectively initialize or tear down the xen virtual network.
- When you use a file-backed virtual storage you will receive a low I/O performance
- Migration
- not live: moves a virtual machine from one host to another by pausing it, copying its memory contents, and then resuming it on the destination
- ...
- Migration
- The continued success of the Xen hypervisor depends substantially on the development of a highly skilled community of developers who can both contribute to the project and use the technology within their own products.
- Xen provides an abstract interface to devices, built on some core communication systems provided by the hypervisor.
- Early versions of xen only supported paravirtualizaed guests.
- NAT: Network Address Translation, 是一種將私有地址轉化爲合法IP地址的轉換技術. 它被廣泛應用於各種類型的internet接入方式和各種類型的網絡中. 原因很簡單, NAT不僅完美地解決了IP地址不足的問題, 而且還能夠有效地避免來自網絡外部的攻擊. 隱藏並保護網絡內部的計算機.
- 網橋: 像一個聰明的中繼器. 中繼器從一個網絡電纜裏接受信號, 放大他們, 將其從入另一個電纜. 網橋可以是專門硬件設備, 也可以由計算機加裝的網橋軟件來實現. 這時計算機會安裝多個網絡適配器(網卡). 網橋在網絡互聯中起到數據接受, 地址過濾與數據轉發功能, 用來實現多個網絡系統之間的數據交換.
- NUMA:
- 理解
- Hardware itself can not lead to high performance. Special issue for NUMA must be cared to leverage underling NUMA hardware. For example, Linux has special scheduler for NUMA. The same as Linux, Xen also need to support NUMA to get high performance.Unfortunately, Xen in RHEL support NUMA in very limitedly way. For the hypervisor level, it only "knows" that the underling hardware is NUMA, but has no "policy" to deal with it. To enable "NUMA awareness" in Xen, you must set "numa=on" parameter( default is off ) in kernel command line. It is better to set "loglvl=all" option to get more log information, which makes it easier to be sure about the openness of the "NUMA awareness". The command line looks like bellow:
nr_cpus : 4
nr_nodes : 1
If nr_node>=2, underlying machine is NUMA.
- (XEN) Note:
- A guest can not be given more VCPUs than it was initialized with on creation. If we do so,the guest will get number of VCPUs it was initialized with on creation.
- HVM guests are allocated a certain amount at creation time, and that is the number they are stuck with. That is, HVM do not support vcpu hot plug/unplug.
- (XEN) Ballon
- Rather than changing the amount of main memory addressed by the guest kernel, the balloon driver takes or gives pages to the guest OS. That is the balloon driver will consume memory in a guest and allow the hypervisor to allocate it elsewhere. The guest kernel still believes the balloon driver has that memory and is not aware that it is probably being used by another guest. If a guest needs more memory, the balloon driver will request memory from hypervisor, then give it back to the kernel. This comes with a consequence: the guest starts with the maximum amount of memory, maxmem, it will be allowed to use.
- Currently, xend only automatically balloon down Domain0 when there is not enough free memory in hypervisor to start a new guest. Xend doesn`t automatically balloon down Domain0 when there isn`t enough free memory to balloon up an existed guest. We could get the current size of free memory via xm info:
- (XEN)
- The advantage of this PV PCI passthru method is that it has been available for years in Xen, and it doesn't require any special hardware or chipset (it doesn't require hardware IOMMU (VT-d) support from the hardware).
- IOMMU:input/output memory management unit。
- (XEN) 修改橋接網卡
可以上網了,但虛擬機還無法上網,因爲虛擬機通過實體機的eth1口上網,默認會使用xenbr1網橋。
我們可以修改每個虛擬機的配置文件,將bridge從xenbr1改爲xenbr0,如下粗體字:
vif = [ "mac=00:16:3e:0f:91:9f,bridge=xenbr1" ]
執行完上述步驟以後要關閉虛擬機,重讀配置文件後再次啓動。
也可以將xenbr1網橋和eth1網口脫鉤,轉而和eth0網卡掛接。我們通過修改配置文件就可以看到xen
是如何將邏輯橋接網口和物理網口掛接的。
在/etc/xen/xend-config.sxp中,我們可以看到如下注釋說明的內容:
#The bridge is named xenbr0, by default. To rename the bridge, use
# (network-script 'network-bridge bridge=<name>')
這一句話是說默認網橋的網橋是xenbr0,如果想採用非默認網橋,可以使用括號裏的格式;
我們用的eth1做網卡,所以配置文件應該是這樣的:
(network-script 'network-bridge netdev=eth1')
爲什麼使用eth1就是使用xenbr1哪?我們能不能讓xenbr1使用eth0網口哪?改成如下配置:
(network-script 'network-bridge netdev=eth0 bridge=xenbr1')
然後重啓xend服務即可。
(看起來好像必須要有pethx+xenbrx+你在configure 文件中用xenbrx虛擬機纔有網絡)
- Remeber IOMMU和MMU區別:
- On some platforms, it is possible to make use of an Input/Output Memory
it maps between a physical and a virtual address space. The difference is the
application; whereas an MMU performs this mapping for applications running on
the CPU, the IOMMU performs it for devices.
- When a page fault occurs, the block device driver can only perform DMA transfers into the bottom part of physical memory. If the page fault occurs
correct address, which is very slow.
- 我的理解, 現在的設備有DMA功能, 但是只能在4GB on device的空間進行DMA, 當操作發生在4GB之外的地方就必須藉助CPU進行操作, 而通過IOMMU則可以在>4GB的地方進行DMA
- Binary Rewriting
- The binary rewriting approach requires that the instruction stream be scanned
then rewritten to point to their emulated versions. Performance from this approach is not ideal, particularly when doing anything
I/O intensive. In implementation, this is actually very similar to how a debugger works. For
a debugger to be useful, it must provide the ability to set breakpoints, which
will cause the running process to be interrupted and allow it to be inspected by
the user. A virtualization environment that uses this technique does something
similar. It inserts breakpoints on any jump and on any unsafe instruction. When
it gets to a jump, the instruction stream reader needs to quickly scan the next part
for unsafe instructions and mark them. When it reaches an unsafe instruction, it
has to emulate it.
ing a debugger easier. These features allow particular addresses to be marked, for
example, and the debugger automatically activated. These can be used when
writing a virtual machine that works in this way. Consider the hypothetical in-
struction stream in Figure 1.1. Here, two breakpoint registers would be used, DR0
and DR1, with values set to 4 and 8, respectively. When the first breakpoint is
reached, the system emulates the privileged instruction, sets the program counter
to 5, and resumes. When the second is reached, it scans the jump target and sets
the debug registers accordingly. Ideally, it caches these values, so the next time it
jumps to the same place it can just load the debug register values.
This is something that VMWare (as far as I understand). To the best of my understanding (but I have never seen the VMWare source code), the method consists of basically doing runtime patching of code that needs to run differently - typically, this involves replacing an existing op-code with something else - either causing a trap to the hypervisor, or a replacement set of code that "does the right thing". If I understand how this works in VMWare is that the hypervisor "learns" the code by single-stepping through a block, and either applies binary patches or marks the section as "clear" (doesn't need changing). The next time this code gets executed, it has already been patched or is clear, so it can run at "full speed".
- the
- Paravirtualization:
- From the perspective of an operating system, the biggest difference is that it
perform any privileged instructions. In order to provide similar functionality, the
hypervisor exposes a set of hypercalls that correspond to the instructions. (Use hypercall)
- A hypercall is conceptually similar to a system call. On UNIX3 systems, the
interrupt, or invoke a system call instruction if one exists. To issue the exit (0)
system call on FreeBSD, for example, you would execute a sequence of instructions
similar to that shown in Listing 1.1.
- (MMU): 內存管理單元(英語:memory management unit,縮寫爲MMU),有時稱作分頁內存管理單元(英語:paged memory management unit,縮寫爲PMMU)。它是一種負責處理中央處理器(CPU)的內存訪問請求的計算機硬件。它的功能包括虛擬地址到物理地址的轉換(即虛擬內存管理)、內存保護、中央處理器高速緩存的控制,在較爲簡單的計算機體系結構中,負責總線的仲裁以及存儲體切換(bank switching,尤其是在8位的系統上)
- ISA: Instruction Set Architecture
- IVT adds a new mode to the processor, called VMX. A hypervisor can run in
the CPU is in VMX mode, it looks normal from the perspective of an unmodified
OS. All instructions do what they would be expected to, from the perspective of
the guest, and there are no unexpected failures as long as the hypervisor correctly
performs the emulation.
A set of extra instructions is added that can be used by a process in VMX root
mode. These instructions do things like allocating a memory page on which to
store a full copy of the CPU state, start, and stop a VM. Finally, a set of bitmaps is
defined indicating whether a particular interrupt, instruction, or exception should
be passed to the virtual machine’s OS running in ring 0 or by the hypervisor
running in VMX root mode.
In addition to the features of Intel’s VT4, AMD’s Pacifica provides a few extra
things linked to the x86-64 extensions and to the Opteron architecture. Current
Opterons have an on-die memory controller. Because of the tight integration
between the memory controller and the CPU, it is possible for the hypervisor to
delegate some of the partitioning to the memory controller.
Using AMD-V, there are two ways in which the hypervisor can handle mem-
ory partitioning. In fact, two modes are provided. The first, Shadow Page Tables,
allows the hypervisor to trap whenever the guest OS attempts to modify its page
tables and change the mapping itself. This is done, in simple terms, by marking
the page tables as read only, and catching the resulting fault to the hypervisor,
instead of the guest operating system kernel. The second mode is a little more
complicated. Nested Page Tables allow a lot of this to be done in hardware.
Nested page tables do exactly what their name implies; they add another layer
of indirection to virtual memory. The MMU already handles virtual to physical
translations as defined by the OS. Now, these “physical” addresses are translated
to real physical addresses using another set of page tables defined by the hyper-
visor. Because the translation is done in hardware, it is almost as fast as normal
virtual memory lookups.
- 爲了讓一個體繫結構可以被虛擬化, Popek和Goldberg認爲所有的敏感指令必須同時是特權指令. 直觀地說, 任何指令用一種會影響其他進行的方式改變機器狀態的行爲都必須遭到hypervisor的阻止
- 虛擬化內存比較簡單: 只需要把內存分割爲多個區域, 然後每一個訪問物理內存的特權級指令發生陷入, 並由一個映射到所允許的內存區域的指令所代替. 一個現代的CPU包括一個內存管理但願MMU, 正是內存管理但願基於操作系統提供的信息實現了上述翻譯過程.
- DMA 直接內存存取(DMA) 改善系統實時效能的一個熟知的方法是,額外提供一個邏輯模塊,在事件發生時產生響應,並允許處理器在較方便的時間來處理信息。這個DMA控制器通常將傳送到模塊的信息複製到內存(RAM),並允許已處理的信息自動從內存移到外部外圍裝置。所有這些工作皆獨立於目前的CPU活動-詳見圖1。這種方式肯定有所助益,但其效益僅限於延遲必然發生的事件-CPU還是得在某一時間處理信息。S12X採用一個根本的方法,即提供「智能型DMA」控制器,不只移動資料,同時直接執行所有的處理工作。
- Now, both Intel and AMD have added a set of instructions that makes virtu-
as Pacifica, whereas Intel’s extensions are known simply as (Intel) Virtualization
Technology (IVT or VT ). The idea behind these is to extend the x86 ISA to make
up for the shortcomings in the existing instruction set. Conceptually, they can be
thought of as adding a “ring -1” above ring 0, allowing the OS to stay where it
expects to be and catching attempts to access the hardware directly.
- (XEN)
- xen: This package contains the Xen tools and management daemons needed to run virtual machines on x86, x86_64, and ia64 systems. Information on how to use Xen can be found at the Xen project pages. The Xen system also requires the Xen hypervisor and domain-0 kernel, which can be found in the kernel-xen* package.
- kernel-xen:
- (XEN) xen爲什麼會有方案+機制分離, 或者XEN爲什麼想盡力減少hypervisor的代碼, 原因是: 如果xen出現bug則會危機到其上運行的虛擬機, 影響很大, 所以要儘可能減少其代碼量, 減少出錯的機率, 增加穩定性
- (XEN) Early versions of Xen did a lot more in the hypervisor. Network multiplexing,
operating systems already include very flexible features for bridging and tunnelling
virtual network interfaces, so it makes more sense to use these than implement
something new.
- Another advantage of relying on Domain 0 features is ease of administration.
a BSD or Linux administrator already has a significant amount of time and effort
invested in learning about it. Such an administrator can use Xen easily, since she
can re-use her existing knowledge.
- (XEN)
- model=e1000: Ethernet 81540
- model=rtl8139: Ethernet 8139
- model=* : Ethernet 8139
- (XEN) xm list - > Times: The Time column is deceptive. Virtual IO (network and block devices) used by Domains requires coordina-
IO. Use of this time value to determine relative utilizations by domains is thus very suspect, as a high
IO workload may show as less utilized than a high CPU workload. Consider yourself warned.
- (XEN) disk
disk = [ "stanza1", "stanza2", ... ]
Each stanza has 3 terms, separated by commas, "backend-dev,frontend-dev,mode".
backend-dev
The device in the backend domain that will be exported to the guest (frontend) domain. Supported formats
include:
phy:device - export the physical device listed. The device can be in symbolic form, as in sda7, or as the
hex major/minor number, as in 0x301 (which is hda1).
file://path/to/file - export the file listed as a loopback device. This will take care of the loopback
setup before exporting the device.
frontend-dev
How the device should appear in the guest domain. The device can be in symbolic form, as in sda7, or as
the hex major/minor number, as in 0x301 (which is hda1).
mode
The access mode for the device. There are currently 2 valid options, r (read-only), w (read/write).
- (XEN) The output contains a variety of labels created by a user to define the rights and access privileges of certain domains. This is a part of sHype Xen Access Controls, which is another level of security an administrator may use. This is an optional feature of Xen, and in fact must be compiled into a Xen kernel if it is desired.
- DomU 它通常不被允許執行任何能夠直接訪問硬件的hypercall, 雖然在某些情況下它被允許方位一個或更多的設備
- IOMMU/MMU
IOMMU= input/output memory management unit
MMU = memory management unit 內存管理單元
The IOMMU is a MMU that connects a DMA-capable I/O bus to the primary storage memory.
Like the CPU memory management unit, an IOMMU take care of mapping virtual address(設備地址) to physical addresses and some units gurarantee memeory protection from misbehaving devices.
IOMMU, IO Memory Management Units, are hardware devices that translate device DMA addresses to machine addresses. An isolation capable IOMMU restricts a device so that it can only access parts of memory it has been explicitly granted access to.
把DMA控制器看成一個cpu,然後,想象這個cpu用IOMMU去訪問主存。與通常的cpu訪問主存一樣的方法。
IOMMU主要解決IO總線的寬度不夠。利用系統主存有64G,而外設總線用32 bit PCI,那麼外設的DMA控制器只能訪問主存的0-4G。而通常低地址內存都緊張,那麼DMA就用IOMMU把地址重新映射一下。
OS控制IOMMU和MMU
Memory protection from malicious or misbehaving devices: a device cannot read or write to memory that hasn't been explicitly allocated (mapped) for it. The memory protection is based on the fact that OS running on the CPU (see figure) exclusively controls both
the MMU and the IOMMU. The devices are physically unable to circumvent or corrupt configured memory management tables.
- (XEN)
- When an operating system boots, one of the first things it typically does is query
like the amount of RAM available, what peripherals are connected, and what the
current time is.
A kernel booting in a Xen environment, however, does not have access to the
real firmware. Instead, there must be another mechanism. Much of the required
information is provided by shared memory pages. There are two of these: the
first is mapped into the guest’s address space by the domain builder at guest boot
time; the second must be explicitly mapped by the guest.
- MMU:
- 4. MMU 請點評
首先引入兩個概念,虛擬地址和物理地址。如果處理器沒有MMU,或者有MMU但沒有啓用,CPU執行單元發出的內存地址將直接傳到芯片引腳上,被內存芯片(以下稱爲物理內存,以便與虛擬內存區分)接收,這稱爲物理地址(Physical Address,以下簡稱PA),如下圖所示。
物理地址
圖 17.6. 虛擬地址
虛擬地址
處理器模式
- ——地址範圍、虛擬地址映射爲物理地址 以及 分頁機制
在沒有使用虛擬存儲器的機器上,虛擬地址被直接送到內存總線上,使具有相同地址的物理存儲器被讀寫;而在使用了虛擬存儲器的情況下,虛擬地址不是被直接送到內存地址總線上,而是送到存儲器管理單元MMU,把虛擬地址映射爲物理地址。
大多數使用虛擬存儲器的系統都使用一種稱爲分頁(paging)機制。虛擬地址空間劃分成稱爲頁(page)的單位,而相應的物理地址空間也被進行劃分,單位是頁幀(frame).頁和頁幀的大小必須相同。在這個例子中我們有一臺可以生成32位地址的機器,它的虛擬地址範圍從0~0xFFFFFFFF(4G),而這臺機器只有256M的物理地址,因此他可以運行4G的程序,但該程序不能一次性調入內存運行。這臺機器必須有一個達到可以存放4G程序的外部存儲器(例如磁盤或是FLASH),以保證程序片段在需要時可以被調用。在這個例子中,頁的大小爲4K,頁幀大小與頁相同——這點是必須保證的,因爲內存和外圍存儲器之間的傳輸總是以頁爲單位的。對應4G的虛擬地址和256M的物理存儲器,他們分別包含了1M個頁和64K個頁幀。
功能編輯
將線性地址映射爲物理地址
現代的多用戶多進程操作系統,需要MMU,才能達到每個用戶進程都擁有自己獨立的地址空間的目標。使用MMU,操作系統劃分出一段地址區域,在這塊地址區域中,每個進程看到的內容都不一定一樣。例如MICROSOFT WINDOWS操作系統將地址範圍4M-2G劃分爲用戶地址空間,進程A在地址0X400000(4M)映射了可執行文件,進程B同樣在地址0X400000(4M)映射了可執行文件,如果A進程讀地址0X400000,讀到的是A的可執行文件映射到RAM的內容,而進程B讀取地址0X400000時,則讀到的是B的可執行文件映射到RAM的內容。
這就是MMU在當中進行地址轉換所起的作用。
提供硬件機制的內存訪問授權
多年以來,微處理器一直帶有片上存儲器管理單元(MMU),MMU能使單個軟件線程工作於硬件保護地址空間。但是在許多商用實時操作系統中,即使系統中含有這些硬件也沒采用MMU。
當應用程序的所有線程共享同一存儲器空間時,任何一個線程將有意或無意地破壞其它線程的代碼、數據或堆棧。異常線程甚至可能破壞內核代碼或內部數據結構。例如線程中的指針錯誤就能輕易使整個系統崩潰,或至少導致系統工作異常。
就安全性和可靠性而言,基於進程的實時操作系統(RTOS)的性能更爲優越。爲生成具有單獨地址空間的進程,RTOS只需要生成一些基於RAM的數據結構並使MMU加強對這些數據結構的保護。基本思路是在每個關聯轉換中“接入”一組新的邏輯地址。MMU利用當前映射,將在指令調用或數據讀寫過程中使用的邏輯地址映射爲存儲器物理地址。MMU還標記對非法邏輯地址進行的訪問,這些非法邏輯地址並沒有映射到任何物理地址。
這些進程雖然增加了利用查詢表訪問存儲器所固有的系統開銷,但其實現的效益很高。在進程邊界處,疏忽或錯誤操作將不會出現,用戶接口線程中的缺陷並不會導致其它更關鍵線程的代碼或數據遭到破壞。目前在可靠性和安全性要求很高的複雜嵌入式系統中,仍然存在採無存儲器保護的操作系統的情況,這實在有些不可思議。
採用MMU還有利於選擇性地將頁面映射或解映射到邏輯地址空間。物理存儲器頁面映射至邏輯空間,以保持當前進程的代碼,其餘頁面則用於數據映射。類似地,物理存儲器頁面通過映射可保持進程的線程堆棧。RTOS可以在每個線程堆棧解映射之後,很容易地保留邏輯地址所對應的頁面內容。這樣,如果任何線程分配的堆棧發生溢出,將產生硬件存儲器保護故障,內核將掛起該線程,而不使其破壞位於該地址空間中的其它重要存儲器區,如另一線程堆棧。這不僅在線程之間,還在同一地址空間之間增加了存儲器保護。
存儲器保護(包括這類堆棧溢出檢測)在應用程序開發中通常非常有效。採用了存儲器保護,程序錯誤將產生異常並能被立即檢測,它由源代碼進行跟蹤。如果沒有存儲器保護,程序錯誤將導致一些細微的難以跟蹤的故障。實際上,由於在扁平存儲器模型中,RAM通常位於物理地址的零頁面,因此甚至NULL指針引用的解除都無法檢測到。
4MMU和CPU編輯
X86系列的MMU
INTEL出品的80386CPU或者更新的CPU中都集成有MMU. 可以提供32BIT共4G的地址空間.
X86 MMU提供的尋址模式有4K/2M/4M的PAGE模式(根據不同的CPU,提供不同的能力),此處提供的是目前大部分操作系統使用的4K分頁機制的描述,並且不提供ACCESS CHECK的部分。
涉及的寄存器
a) GDT
b) LDT
c) CR0
d) CR3
e) SEGMENT REGISTER
虛擬地址到物理地址的轉換步驟
a) SEGMENT REGISTER作爲GDT或者LDT的INDEX,取出對應的GDT/LDT ENTRY.
注意: SEGMENT是無法取消的,即使是FLAT模式下也是如此. 說FLAT模式下不使用SEGMENT REGISTER是錯誤的. 任意的RAM尋址指令中均有DEFAULT的SEGMENT假定. 除非使用SEGMENT OVERRIDE PREFⅨ來改變當前尋址指令的SEGMENT,否則使用的就是DEFAULT SEGMENT.
ENTRY格式
typedef struct
{
UINT16 limit_0_15;
UINT16 base_0_15;
UINT8 base_16_23;
UINT8 accessed : 1;
UINT8 readable : 1;
UINT8 conforming : 1;
UINT8 code_data : 1;
UINT8 app_system : 1;
UINT8 dpl : 2;
UINT8 present : 1;
UINT8 limit_16_19 : 4;
UINT8 unused : 1;
UINT8 always_0 : 1;
UINT8 seg_16_32 : 1;
UINT8 granularity : 1;
UINT8 base_24_31;
} CODE_SEG_DESCRIPTOR,*PCODE_SEG_DESCRIPTOR;
typedef struct
{
UINT16 limit_0_15;
UINT16 base_0_15;
UINT8 base_16_23;
UINT8 accessed : 1;
UINT8 writeable : 1;
UINT8 expanddown : 1;
UINT8 code_data : 1;
UINT8 app_system : 1;
UINT8 dpl : 2;
UINT8 present : 1;
UINT8 limit_16_19 : 4;
UINT8 unused : 1;
UINT8 always_0 : 1;
UINT8 seg_16_32 : 1;
UINT8 granularity : 1;
UINT8 base_24_31;
} DATA_SEG_DESCRIPTOR,*PDATA_SEG_DESCRIPTOR;
共有4種ENTRY格式,此處提供的是CODE SEGMENT和DATA SEGMENT的ENTRY格式. FLAT模式下的ENTRY在base_0_15,base_16_23處爲0,而limit_0_15,limit_16_19處爲0xfffff. granularity處爲1. 表名SEGMENT地址空間是從0到0XFFFFFFFF的4G的地址空間.
b) 從SEGMENT處取出BASE ADDRESS 和LIMIT. 將要訪問的ADDRESS首先進行ACCESS CHECK,是否超出SEGMENT的限制.
c) 將要訪問的ADDRESS+BASE ADDRESS,形成需要32BIT訪問的虛擬地址. 該地址被解釋成如下格式:
typedef struct
{
UINT32 offset :12;
UINT32 page_index :10;
UINT32 pdbr_index :10;
} VA,*LPVA;
d) pdbr_index作爲CR3的INDEX,獲得到一個如下定義的數據結構
typedef struct
{
UINT8 present :1;
UINT8 writable :1;
UINT8 supervisor :1;
UINT8 writethrough:1;
UINT8 cachedisable:1;
UINT8 accessed :1;
UINT8 reserved1 :1;
UINT8 pagesize :1;
UINT8 ignoreed :1;
UINT8 avl :3;
UINT8 ptadr_12_15 :4;
UINT16 ptadr_16_31;
}PDE,*LPPDE;
e) 從中取出PAGE TABLE的地址. 並且使用page_index作爲INDEX,得到如下數據結構
typedef struct
{
UINT8 present :1;
UINT8 writable :1;
UINT8 supervisor :1;
UINT8 writethrough:1;
UINT8 cachedisable:1;
UINT8 accessed :1;
UINT8 dirty :1;
UINT8 pta :1;
UINT8 global :1;
UINT8 avl :3;
UINT8 ptadr_12_15 :4;
UINT16 ptadr_16_31;
}PTE,*LPPTE;
f) 從PTE中獲得PAGE的真正物理地址的BASE ADDRESS. 此BASE ADDRESS表名了物理地址的.高20位. 加上虛擬地址的offset就是物理地址所在了.
ARM系列的MMU
ARM出品的CPU,MMU作爲一個協處理器存在。根據不同的系列有不同搭配。需要查詢DATASHEET纔可知道是否有MMU。如果有的話,一定是編號爲15的協處理器。可以提供32BIT共4G的地址空間。
ARM MMU提供的分頁機制有1K/4K/64K 3種模式. 本文介紹的是目前操作系統通常使用的4K模式。
涉及的寄存器,全部位於協處理器15.
ARM cpu地址轉換涉及三種地址:虛擬地址(VA,Virtual Address),變換後的虛擬地址(MVA,Modified Virtual Address),物理地址(PA,Physical Address)。沒有啓動MMU時,CPU核心、cache、MMU、外設等所有部件使用的都是物理地址。啓動MMU後,CPU核心對外發出的是虛擬地址VA,VA被轉換爲MVA供cache、MMU使用,並再次被轉換爲PA,最後使用PA讀取實際設備。
ARM沒有SEGMENT的寄存器,是真正的FLAT模式的CPU。給定一個ADDRESS,該地址可以被理解爲如下數據結構:
typedef struct
{
UINT32 offset :12;
UINT32 page_index :8;
UINT32 pdbr_index :12;
} VA,*LPVA;
從MMU寄存器2中取出BIT14-31,pdbr_index就是這個表的索引,每個入口爲4BYTE大小,結構爲
typedef struct
{
UINT32 type :2; //always set to 01b
UINT32 writebackcacheable:1;
UINT32 writethroughcacheable:1;
UINT32 ignore :1; //set to 1b always
UINT32 domain :4;
UINT32 reserved :1; //set 0
UINT32 base_addr:22;
} PDE,*LPPDE;
獲得的PDE地址,獲得如下結構的ARRAY,用page_index作爲索引,取出內容。
typedef struct
{
UINT32 type :2; //always set to 11b
UINT32 ignore :3; //set to 100b always
UINT32 domain :4;
UINT32 reserved :3; //set 0
UINT32 base_addr:20;
} PTE,*LPPTE;
從PTE中獲得的基地址和上offset,組成了物理地址.
PDE/PTE中其他的BIT,用於訪問控制。這邊講述的是一切正常,物理地址被正常組合出來的狀況。
ARM/X86 MMU使用上的差異
⒈X86始終是有SEGMENT的概念存在. 而ARM則沒有此概念(沒有SEGMENT REGISTER.).
⒉ARM有個DOMAIN的概念. 用於訪問授權. 這是X86所沒有的概念. 當通用OS嘗試同時適用於此2者的CPU上,一般會拋棄DOMAIN的使用.
- (XEN)Shared Info page: 該頁面內容存放在一個結構體中, 在guest kernel中會生命一個只想該結構體的指針, 在guest啓動過程中, hypervisor會先將該結構體地址存入esi寄存器中, 然後在guest啓動過程中會來讀取該寄存器的值, 從而或者該頁的信息.
- Red Hat Enterprise Virtualization (RHEV) is a complete enterprise virtualization management solution for server and desktop virtualization, based on Kernel-based Virtual Machine (KVM) technology.
- In computing, hardware-assisted virtualization is a platform virtualization approach that enables efficient full virtualization using help from hardware capabilities, primarily from the host processors. Full virtualization is used to simulate a complete hardware environment, or virtual machine, in which an unmodified guest operating system (using the same instruction set as the host machine) executes in complete isolation. Hardware-assisted virtualization was added to x86 processors (Intel VT-x or AMD-V) in 2006.
Hardware-assisted virtualization is also known as accelerated virtualization; Xen calls it hardware virtual machine (HVM), and Virtual Iron calls it native virtualization.
- (XEN)將虛擬網橋附加到網卡上:
#/etc/xen/scripts/network-bridge netdev=eth0 bridge=xenbr1 start
- Software virtualization is unsupported by Red Hat Enterprise Linux.
- Term of RHEV-H, RHEV-M, Hyper-V
- 影子頁表和嵌套頁表(概念)
- 硬件虛擬話+1 Intel-EPT, AMD- NPT
NPT: Nested Page Tables
- Instead, targeted modifications are introduced to make it simpler and faster to support multiple guest operating systems. For example, the guest operating system might be modified to use a special hypercall application binary interface (ABI) instead of using certain architectural features that would normally be used. This means that only small changes are typically required in the guest operating systems, but any such changes make it difficult to support closed-source operating systems that are distributed in binary form only, such as Microsoft Windows. As in full virtualization, applications are typically still run unmodified. Figure 1.4 illustrates paravirtualization.
Xen extends this model to device I/O. It exports simplified, generic device interfaces to guest operating systems. This is true of a Xen system even when it uses hardware support for virtualization allowing the guest operating systems to run unmodified. Only device drivers for the generic Xen devices need to be introduced into the system.
- QEMU— Another example of an emulator, but the ways in which QEMU is unlike Bochs are worth noting. QEMU supports two modes of operation. The first is the Full System Emulation mode, which is similar to Bochs in that it emulates a full personal computer (PC), including peripherals. This mode emulates a number of processor architectures, such as x86, x86_64, ARM, SPARC, PowerPC, and MIPS, with reasonable speed using dynamic translation. Using this mode, you can emulate an environment capable of hosting the Microsoft Windows operating systems (including XP) and Linux guests hosted on Linux, Solaris, and FreeBSD platforms.
Additional operating system combinations are also supported. The second mode of operation is known as User Mode Emulation. This mode is only available when executing on a Linux host and allows binaries for a different architecture to be executed. For instance,
binaries compiled for the MIPS architecture can be executed on Linux executing on x86 architectures. Other architectures supported in this mode include SPARC, PowerPC, and ARM, with more in development. Xen relies on the QEMU device model for HVM guests.
- 模擬(Emulation): 完全將指令進行翻譯, 效率極其低下
- VMware has a bare-metal product, ESX Server, . With VMware workstation, the hypervisor runs in hosted mode as an application installed on top of a base operating system such as Windows or Linux
- The method of KVM operation is rather interesting. Each guest running on KVM is actually executed in user space of the host system. This approach makes each guest instance (a given guest kernel and its associated guest user space) look like a normal process to the underlying host kernel. Thus KVM has weaker isolation than other approaches we have discussed. With KVM, the well-tuned Linux process scheduler performs the hypervisor task of multiplexing across the virtual machines just as it would multiplex across user processes under normal operation. To accomplish this, KVM has introduced a new mode of execution that is distinct from the typical modes (kernel and user) found on a Linux system. This new mode designed for virtualized guests is aptly called guest mode. Guest mode has its own user and kernel modes. Guest mode is used when performing execution of all non-I/O guest code, and KVM falls back to normal user mode to support I/O operations for virtual guests.
- The paravirtualization implementation known as User-mode Linux (UML) allows a Linux operating system to execute other Linux operating systems in user space
- The Xen hypervisor sits above the physical hardware and presents guest domains with a virtual hardware interface
- Intel-VT作用?
- With hardware support for virtualization such as Intel's VT-x and AMD's AMD-V extensions, these additional protection rings become less critical. These extensions provide root and nonroot modes that each have rings 0 through 3. The Xen hypervisor can run in root mode while the guest OS runs in nonroot mode in the ring for which it was originally intended.
- Domain0 runs a device driver specific to each actual physical device and then communicates with other guest domains through an asynchronous shared memory transport.
- The physical device driver running in Domain0 or a driver domain is called a backend, and each guest with access to the device runs a generic device frontend driver. The backends provide each frontend with the illusion of a generic device that is dedicated to that domain. Backends also implement the sharing and multiple accesses required to give multiple guest domains the illusion of their own dedicated copy of the device.
- Intel VT
0
When using Intel VT technology, Xen executes in a new operational state called Virtual Machine Extensions (VMX) root operation mode. The unmodified guest domains execute in the other newly created CPU state, VMX non-root operation mode. Because the DomUs
run in non-root operation mode, they are confined to a subset of operations available to the system hardware. Failure to adhere to the restricted subset of instructions causes a VM exit to occur, along with control returning to Xen.
AMD-V
Xen 3.0 also includes support for the AMD-V processor. One of AMD-V's benefits is a tagged translation lookaside buffer (TLB). Using this tagged TLB, guests get mapped to an address space that can be altogether different from what the VMM sets. The reason it
is called a tagged TLB is that the TLB contains additional information known as address space identifiers (ASIDs). ASIDs ensure that a TLB flush does not need to occur at every context switch.
AMD also introduced a new technology to control access to I/O called I/O Memory Management Unit (IOMMU), which is analogous to Intel's VT-d technology. IOMMU is in charge of virtual machine I/O, including limiting DMA access to what is valid for the virtual
machine, directly assigning real devices to VMs. One way to check if your processor is AMD-V capable from a Linux system is to check the output of the /proc/cpunfo for an svm flag. If the flag is present, you likely have an AMD-V processor.
HVM
Intel VT and AMD's AMD-V architectures are fairly similar and share many things in common conceptually, but their implementations are slightly different. It makes sense to provide a common interface layer to abstract their nuances away. Thus, the HVM interface
was born. The original code for the HVM layer was implemented by an IBM Watson Research Center employee named Leendert van Doorn and was contributed to the upstream Xen project. A compatibility listing is located at the Xen Wiki at
http://wiki.xensource.com/xenwiki/HVM_Compatible_Processors.
The HVM layer is implemented using a function call table (hvm_function_table), which contains the functions that are common to both hardware virtualization implementations. These methods, including initialize_guest_resources() and store_cpu_guest_regs(),
are implemented differently for each backend under this unified interface.
- Networking Devices
- Xen device various:
filesystem image or partitioned image: raw, qcow
standard network storage protocols such as: NBD, iSCSI, NFS etc
[host-a]#ls /sys/bus/
acpi/ i2c/ pci/ pcmcia/ pnp/ serio/ xen/
bluetooth/ ide/ pci_express/ platform/ scsi/ usb/ xen-backend/
[host-a]#ls /sys/bus/xen/drivers/
pcifront
[host-a]#ls /sys/bus/xen-backend/drivers/
tap vbd vif
xen back driver:
blktap, blkbk
netloop, netbk
- loop是指拿文件來模擬塊設備
- vfb Xen 底下的 VM 都是透過 VNC 來傳送畫面,所以這裡的 vfb(virtual framebuffer device) 就是設定系統畫面與輸入裝置 Keyboard/Mouse.
- MX root operation(根虛擬化操作)和VMX non-root operation(非根虛擬化操作),統稱爲VMX操作模式
- VT-x
由此,GDT、IDT、LDT、TSS等這些指令就能正常地運行於虛擬機內部了,而在以往,這些特權指令需要模擬運行。 而VMM也能從模擬運行特權指令當中解放出來,這樣既能解決Ring Aliasing問題(軟件運行的實際Ring與設計運行的Ring不相同帶來的問題),又能解決Ring Compression問題,從而大大地提升運行效率。Ring Compression問題的解決,也就解決了64bit客戶操作系統的運行問題。
爲了建立這種兩個操作模式的架構,VT-x設計了一個Virtual-Machine Control Structure(VMCS,虛擬機控制結構)的數據結構,包括了Guest-State Area(客戶狀態區)和Host-State Area(主機狀態區),用來保存虛擬機以及主機的各種狀態參數,並提供了VM entry和VM exit兩種操作在虛擬機與VMM之間切換,用戶可以通過在VMCS的VM-execution control fields裏面指定在執行何種指令/發生何種事件的時候,VMX non-root
operation環境下的虛擬機就執行VM exit,從而讓VMM獲得控制權,因此VT-x解決了虛擬機的隔離問題,又解決了性能問題
- IOMMU, VT-d, VT-x, MMU, HVM , PV, DMA, TLB, QEMU
- AMD-T
VT-x將用於存放虛擬機狀態和控制信息的數據結構稱爲VMCS, 而AMD叫VMCB
VT-x將TLB記錄中用於標記VM地址空間的字段爲VPID, 而AMD-V稱爲ASID
VT-x: root 操作模式, 非root操作模式. AMD-V guest操作模式,host操作模式
VMCS/VMCB包含了啓動和控制虛擬機的全部信息
guest/非root模式的意義在於其讓客戶操作系統處於完全不同的環境, 而不需要改變操作系統的代碼, 在該模式上運行的特權指令即便是在Ring 0 上也變得可以被VMM截取. 此外VMM還可以通過VMCB中的各種截取控制字段選擇性的對指令和事情進行截取, 或設置有條件的截取, 所有的敏感的特權指令和非特權指令都在其控制之中。
- VT-D
傳統的IOMMUs(I/O memory management units,I/O內存管理單元)提供了一種集中的方式管理所有的DMA——除了傳統的內部DMA,還包括如AGP GART、TPT、RDMA over TCP/IP等這些特別的DMA,它通過在內存地址範圍來區別設備,因此容易實現,卻不容易實現DMA隔離,因此VT-d通過更新設計的IOMMU架構, 實現了多個DMA保護區域的存在,最終實現了DMA虛擬化。這個技術也叫做DMA Remapping。
VT-d實現 的中斷重映射可以支持所有的I/O源,包括IOAPICs,以及所有的中斷類型,如通常的MSI以及擴展的MSI-X。
VT-d進行的改動還有很多,如硬件緩衝、地址翻譯等,通過這些種種措施,VT-d實現了北橋芯片級別的I/O設備虛擬化。VT-d最終體現到虛擬化模型上的就是新增加了兩種設備虛擬化方式:
直接I/O設備分配, 虛擬機直接分配物理I/O設備給虛擬機,這個模型下,虛擬機內部的驅動程序直接和硬件設備直接通信,只需要經過少量,或者不經過VMM的管理。爲了系統的健壯性,需要硬件的虛擬化支持,以隔離和保護硬件資源只給指定的虛擬機使用,硬件同時還需要具備多個I/O容器分區來同時爲多個虛擬機服務,這個模型幾乎完全消除了在VMM中運行驅動程序的需求。例如CPU,雖然CPU不算是通常意義的I/O設備——不過它確實就是通過這種方式分配給虛擬機,當然CPU的資源還處在VMM的管理之下。
運用VT-d技術,虛擬機得以使用直接I/O設備分配方式或者I/O設備共享方式來代替傳統的設備模擬/額外設備接口方式,從而大大提升了虛擬化的I/O性能。
- The Sun xVM Server network drivers uses a similar approach to the disk block driver for handling network packets. On DomU, the pseudo network driver xnf(xen-netfront) gets the I/O requests from the network stack and sends them to xnb(xen-netback) on Dom0. The back-end network driver xnb on Dom0 forwards packets sent by xnf to the native network driver.The buffer management for packet receiving has more impact on network performance than packet transmitting does. On the packet receiving end, the data is transferred via DMA into the native driver receiving buffer on dom0. Then, the packet is copied from the native driver buffer to the VMM buffer. The VMM buffer is then mapped to the DomU kernel address space without another copy of the data.
- Data is transferred via DMA into the native driver, bge, receive buffer ring.
- The xnb drivers gets a new buffer from the VMM and copies data from the bge receive ring to the new buffer.
- The xnb driver sends DomU an event through the event channel.
- The xnf driver in DomU receives an interrupt.
- The xnf driver maps a mblk(9S)to the VMM buffer and sends the mblk(9S) to the upper stack.