虛擬化技術理解

Created Wednesday 05 March 2014

虛擬機監控程序 : Virtual Machine Monitor 簡稱VMM
虛擬化最大優勢:
- 將不同的應用程序或者服務器運行在不同的虛擬機中, 可以不同不同程序間的相互干擾
- 便於維護, 降低成本
- migration, 保證正常運行
- 方便研究

Xen的應用範圍(from Xen詳解.pdf)
- 服務器整合:在虛擬機範圍內,在一臺物理主機上安裝多個服務器, 用於演示及故障隔絕
- 無硬件依賴:允許應用程序和操作系統對新硬件的移值測試
- 多操作系統配置:以開發和測試爲目的,同時運行多個操作系統
- 內核開發:在虛擬機的沙盒中,做內核的測試和調試,無需爲了測試而單獨架設一臺獨立的

機器

集羣運算: 和單獨的管理每個物理主機相比較,在 VM 級管理更加靈活,在負載均衡方面,更易於控制,和隔離
爲客戶操作系統提供硬件技術支持: 可以開發新的操作系統, 以得益於現存操作系統的廣泛硬件支持,比如 Linux;

ISA: Instruction Set Architecture, 指令集
當虛擬的指令集與物理的指令集相同時, 可以運行沒有任何修改的操作系統, 而當兩者不完全相同時, 客戶機的操作系統就必須在源代碼級或二進制級作相應修改 (敏感指令完全屬於特權指令)
根據是否需要小修改操作系統源代碼, 虛擬化技術又分爲
- Paravirtualization: 泛虛擬化, 超虛擬化, 半虛擬化
- Full-virtualization: 完全虛擬化
事件通道(Event channel) 是Xen用於虛擬域和VMM之間, 虛擬域之間的一種異步事件通知機制. 共分8組, 每128個通道一組.
Hypervisor服務(Hypercall), Hypercall如同操作系統下的系統調用
硬件虛擬化, Intel VT, AMD svm Intel-VT(vmx), AMD-v(svm) , Pass-through: VT-d, IOMMU
Intel VT技術引入了一種新的處理器操作, 成爲VMX(Virtual Machine Extensions),
每一個X86客戶機的內存地址總是從0開始. 也因此監控程序必須把客戶機虛擬地址到客戶機物理地址進行重新映射. Xen泛虛擬化實現採用修改客戶機頁表的方式實現這一重新映像.
對CPU特權級的理解: CPU特權級的作用主要體現在2個方面:
- CPU根據當前代碼段的特權級決定代碼能執行的指令
- 特權級爲3的long jump到特權級爲0的段或訪問特權級爲0的數據段時, 會被CPU禁止, 只有int指令除外
敏感指令
- 敏感指令引入虛擬化後, Guest OS就不能運行在Ring 0 上. 因此, 原本需要在最高級別下執行的指令就不能直接執行, 而是交由VMM處理執行. 這部分指令稱爲敏感指令. 當執行這些指令時, 理論上都要產生trap被VMM捕獲執行.
- 敏感指令包括:
  - 企圖訪問或修改虛擬機模式或機器狀態指令
  - 企圖訪問或修改敏感寄存器或存儲單元, 如始終寄存器, 中斷寄存器等指令
  - 企圖訪問存儲保護系統或內存, 地址分配系統的指令
  - 所有的I/O指令
虛擬化舉例:
- 完全虛擬化
  - VMware
  - VirtualBox
  - Virtual PC
  - KVM-x86
- 半虛擬化, 剛開始爲了突破x86架構的全虛擬化限制, 後來主要是爲了提高虛擬化的效率
  - Xen, KVM-PowerPC
處理器呈現給軟件的接口就是一堆的指令(指令集)和一堆的寄存器. 而I/0設備呈現給軟件的藉口也是一堆的狀態和控制寄存器(有些設備也有內部存儲). 其中影響處理器和設備狀態和行爲的寄存器成爲關鍵資源或特權資源.
可以讀寫系統關鍵資源的指令叫做敏感指令. 絕大多數的銘感指令是特權指令. 特權指令只能在處理器的最高特權級(內核態)執行. 對於一般RISC處理器, 敏感指令肯定是特權指令, 唯x86除外.

正是爲了這個例外, 造就了後來x86上的虛擬化技術的江湖紛爭.先是一VMware爲代表的Full virtualization 派對無需修改直接運行理念的偏執, 到後來Xen 適當修改Guest OS後獲得極佳的性能, 以致讓Para virtualization大熱, 再到後來Intel 和AMD捲入戰火, 從硬件上擴展, 一來解決傳統x86虛擬化的困難, 二來爲了性能的提升；到最後硬件擴展皆爲Full派和Para派所採用. 自此Para派的性能優勢不再那麼明顯, Full派的無需修改直接運行的友好性漸佔上風.

經典的虛擬化方法

經典的虛擬化方法主要使用"特權解除"(Privilege deprivileging)"和"陷入-模擬(Trap-and-Emulation)的方法. 即: 將Guest OS運行在非特權級(特權解除), 而將VMM運行於最高特權級(完全控制系統資源). 解除了Guest OS的特權後, Guest OS的大部分指令仍可在硬件上直接運行. 只有當運行到特權指令時, 纔會陷入到VMM模擬執行(陷入-模擬)
由此可引入虛擬化對體系結構(ISA)的要求:

- 須支持多個特權級
- 非敏感指令的執行結果不依賴於CPU的特權級.
- CPU需要支持一種保護機制, 如MMU, 可將物理系統和其他VM與當前活動的VM隔離
- 敏感指令需皆爲特權指令
x86 ISA中有十多條敏感指令不是特權指令, 因此x86無法使用經典的虛擬化技術完全虛擬化.

x86虛擬化方法

鑑於x86指令集本身的侷限性, 長期以來對x86的虛擬化實現大致分爲兩派, 即以VMware爲代表的Full virtualization派和以Xen爲代表的Paravirtualizaiton派. 兩派的分歧主要在對非特權敏感指令的處理上. Full派採用的是動態的方法, 即: 運行時監測. 樸拙後在VMM中模擬. 而Para派則主動進攻, 將所有用到的非特權敏感指令全部替換, 這樣就減少了大量的陷入-> 上下文切換->模擬-> 上下文切換的過程. 獲得了大幅的性能提升. 但缺點也很明顯.

完全虛擬化派

秉承無需修改直接運行的理念, 該派一直在對"運行時監測, 樸拙後模擬"的過程, 進行偏執的優化. 該派內部又有些差別, 其有VMware爲代表的BT和以SUN爲代表的Scan-and-Patch

基於二進制翻譯(BT)的完全虛擬化方法

其主要思想是在執行時將, VM上執行的Guest OS之指令, 翻譯成x86 ISA的一個子集, 其中的敏感指令被替換成陷入指令. 翻譯過程與執行交替執行. 不含敏感指令的用戶態程序可以不經翻譯直接執行. 該技術爲VMware Workstation, VMware ESX Server的早期版本, VirtualPC以及QEMU所採用

基於掃描與修補的完全虛擬化方法(SUN之VirtualBox)

(1) VMM會在VM運行每塊指令之前對其進行掃描, 查找敏感指令.
(2) 補丁指令塊在VMM中動態生成, 通常每一個需要修補的指令會對應一塊補丁指令
(3) 敏感指令被替換成一個外跳轉, 從VM跳到VMM, 在VMM中執行動態生成的補丁指令塊
(4) 當補丁指令塊執行完後, 執行流再跳轉回VM的下一條指令處繼續執行

OS協助的類虛擬化派

其基本思想是通過修改Guest OS的代碼, 將含有敏感指令的操作, 替換爲對VMM的超調用(Hypercall), 可將控制權轉移給VMM. 該技術的優勢在於性能能接近於物理機. 缺點在於需要修改Guest OS

Operating System level virtualization

The operating system level virtualization probably has the least overhead among the above three approaches and is the fastest solution. However, it has a limitation that you can only run the same
operating system as the host. This takes away on of the great benefits of virtualization

HVM: Hardware virtual machine
PV: Paravirtualization machine
VMM 又叫Hypervisor

Intel和AMD的虛擬化技術

Intel VT-x: Virtualization Technology for x86
Intel VT-i: Virtualization Technology for Itanium
Intel VT-d: Virtualization Technology for Dircted I/O
AMD-v: AMD Virtualization
其基本思想就是引入新的處理器運行模式和新的指令.使得VMM和Guest OS運行在不同的模式下. Guest OS運行於受控模式下, 原來一些敏感指令在受控模式下會全部陷入VMM, 這樣就解決了部分非特權敏感指令的陷入-模擬難題, 而且模式切換時上下文的保存恢復由硬件來完成. 這樣就大大提高了陷入-模擬時的上下文切換效率.

MMU是Memory Management Unit的縮寫，中文名是內存管理單元，它是中央處理器（CPU）中用來管理虛擬存儲器、物理存儲器的控制線路，同時也負責虛擬地址映射爲物理地址，以及提供硬件機制的內存訪問授權。

實現從頁號到物理塊號的地址映射。
VMM結構
- 宿主模型: OS-hosted VMMS
- Hypervisor模型: Hypervisor VMMs
- 混合模型: Mybrid VMMs
處理器虛擬化原理精要

VMM 對物理資源的虛擬化可以分爲三個部分: 處理器虛擬化, 內存虛擬化和I/O設備虛擬化. 其中以處理器虛擬化最爲關鍵.

TLB

TLB(Translation Lookaside Buffer)翻譯後備緩衝器是一個內存管理單元用於改進虛擬地址到物理地址轉換速度的緩存。
TLB是一個小的，虛擬尋址的緩存，其中每一行都保存着一個由單個PTE組成的塊。如果沒有TLB，則每次取數據都需要兩次訪問內存，即查頁表獲得物理地址和取數據。
TLB：Translation lookaside buffer,即旁路轉換緩衝，或稱爲頁表緩衝；裏面存放的是一些頁表文件（虛擬地址到物理地址的轉換表）。
又稱爲快表技術。由於“頁表”存儲在主存儲器中，查詢頁表所付出的代價很大，由此產生了TLB。
X86保護模式下的尋址方式：段式邏輯地址—〉線形地址—〉頁式地址；
頁式地址=頁面起始地址+頁內偏移地址；
對應於虛擬地址：叫page（頁面）；對應於物理地址：叫frame（頁框）；
X86體系的系統內存裏存放了兩級頁表，第一級頁表稱爲頁目錄，第二級稱爲頁表。
TLB和CPU裏的一級、二級緩存之間不存在本質的區別，只不過前者緩存頁表數據，而後兩個緩存實際數據。

xend服務器進程通過domain0來管理系統, xend負責管理衆多的虛擬主機, 並且提供進入這些系統的控制檯.命令經一個命令行工具通過一個http的接口被傳送到xend
xen-3.0.3-142.el5_9.3

Summary: Xen is a virtual machine monitor
Description:

This package contains the Xen tools and management daemons needed to run virtual machines on x86, x86_64, and ia64 systems. Information on how to use Xen can be found at the Xen project pages. The Xen system also requires the Xen hypervisor and domain-0 kernel, which can be found in the kernel-xen* package. Virtualization can be used to run multiple operating systems on one physical system, for purposes of hardware consolidation, hardware abstraction, or to test untrusted applications in a sandboxed environment.

Xen的網絡架構
- Xen支持3種網絡工作模式
  - Bridge 安裝虛擬機時的默認模式
  - Route
  - NAT
- Bridge模式下, Xend啓動時的流程
  1. 創建虛擬網橋
  2. 停止物理網卡eth0
  3. 物理網卡eth0的MAC地址和IP地址被複制到虛擬網卡veth0
  4. 物理網卡eth0重命名爲peth0
  5. veth0重命名爲eth0
  6. peth0的MAC地址更改, ARP功能關閉
  7. 鏈接peth0, vif0.0到網橋xenbr0
  8. 啓動peth0, vif0.0, xenbr0
Xen blktap

blktap 是Xen提供給我們的一套實現虛擬塊設備的框架, 它是運行在用戶空間的.

blktap的工作流程

當xen啓動的時候, 他會先啓動blktapctl, 它是一個後臺程序, 當我們啓動虛擬機的時候就會通過xenbus這個通道把需要的虛擬塊存儲設備註冊到blktapctl中, 該註冊過程會創建2個命名管道以及一個字符設備. 這2個命名管道將被用於字符設備與圖中tapdisk之間的通信.

(XEN) 一般來說 xend 啓動執行 network-bridge 腳本會線把 eth0 的 IP 和 MAC 地址複製給虛擬網絡接口 veth0，然後再把真實的 eth0 重命名爲 peth0，把虛擬的 veth0 重命名爲 eth0

字符設備

字符設備是指在I/O傳輸過程中以字符爲單位進行傳輸的設備. 如鍵盤, 打印機等.

Xend is responsible for managing virtual machines and providing access to their consoles
/etc/xen/xmexample1 is a simple template configuration file for describing a single VM
/etc/xen/xmexample2 file is a templete description that is intended to be reused for multiple virtual machines.

For xen
- service xendomains start will start the configure file which locate in /etc/xen/auto directory
- hypervisor will automatic call xendomains to start the gust which locate in /etc/xen/auto directory
- In fact xendomains call xm to start|stop a xen guest
For xen
- network-bridge: This script is called whenever xend is started or stopped to respectively initialize or tear down the xen virtual network.
- When you use a file-backed virtual storage you will receive a low I/O performance
  - Migration
    - not live: moves a virtual machine from one host to another by pausing it, copying its memory contents, and then resuming it on the destination
    - ...
The continued success of the Xen hypervisor depends substantially on the development of a highly skilled community of developers who can both contribute to the project and use the technology within their own products.
Xen provides an abstract interface to devices, built on some core communication systems provided by the hypervisor.
Early versions of xen only supported paravirtualizaed guests.
NAT: Network Address Translation, 是一種將私有地址轉化爲合法IP地址的轉換技術. 它被廣泛應用於各種類型的internet接入方式和各種類型的網絡中. 原因很簡單, NAT不僅完美地解決了IP地址不足的問題, 而且還能夠有效地避免來自網絡外部的攻擊. 隱藏並保護網絡內部的計算機.
網橋: 像一個聰明的中繼器. 中繼器從一個網絡電纜裏接受信號, 放大他們, 將其從入另一個電纜. 網橋可以是專門硬件設備, 也可以由計算機加裝的網橋軟件來實現. 這時計算機會安裝多個網絡適配器(網卡). 網橋在網絡互聯中起到數據接受, 地址過濾與數據轉發功能, 用來實現多個網絡系統之間的數據交換.
NUMA:
- 理解

NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. In modern computers, memory access is far slower than CPU. And it seems that the trend would sustain in the near future. So CPUs increasingly starved for data, have had to stall while they wait for memory accesses to complete. Multi-processor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time.. Take SMP for example, all the processors share a common memory controller, which makes it hard to get high performance as number of CPUs increase. That is, we can not get scalability for slow memory access and shared memory controller. NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. (同一時間只能有一個cpu訪問內存)

Hardware itself can not lead to high performance. Special issue for NUMA must be cared to leverage underling NUMA hardware. For example, Linux has special scheduler for NUMA. The same as Linux, Xen also need to support NUMA to get high performance.Unfortunately, Xen in RHEL support NUMA in very limitedly way. For the hypervisor level, it only "knows" that the underling hardware is NUMA, but has no "policy" to deal with it. To enable "NUMA awareness" in Xen, you must set "numa=on" parameter( default is off ) in kernel command line. It is better to set "loglvl=all" option to get more log information, which makes it easier to be sure about the openness of the "NUMA awareness". The command line looks like bellow:

kernel /xen.gz-2.6.18-164.el5 numa=on loglvl=all

We could check if the underlying machine is NUMA via xm info:

# xm info | grep nr_
nr_cpus : 4
nr_nodes : 1
If nr_node>=2, underlying machine is NUMA.

(XEN) Note:
1. A guest can not be given more VCPUs than it was initialized with on creation. If we do so,the guest will get number of VCPUs it was initialized with on creation.
2. HVM guests are allocated a certain amount at creation time, and that is the number they are stuck with. That is, HVM do not support vcpu hot plug/unplug.
(XEN) Ballon
- Rather than changing the amount of main memory addressed by the guest kernel, the balloon driver takes or gives pages to the guest OS. That is the balloon driver will consume memory in a guest and allow the hypervisor to allocate it elsewhere. The guest kernel still believes the balloon driver has that memory and is not aware that it is probably being used by another guest. If a guest needs more memory, the balloon driver will request memory from hypervisor, then give it back to the kernel. This comes with a consequence: the guest starts with the maximum amount of memory, maxmem, it will be allowed to use.
- Currently, xend only automatically balloon down Domain0 when there is not enough free memory in hypervisor to start a new guest. Xend doesn`t automatically balloon down Domain0 when there isn`t enough free memory to balloon up an existed guest. We could get the current size of free memory via xm info:
(XEN)
- The advantage of this PV PCI passthru method is that it has been available for years in Xen, and it doesn't require any special hardware or chipset (it doesn't require hardware IOMMU (VT-d) support from the hardware).
IOMMU：input/output memory management unit。
(XEN) 修改橋接網卡

本來宿主機一直使用eth1實體網卡，但有一天eth1網卡壞了，我給eth0插上網線以後，XEN實體機
可以上網了，但虛擬機還無法上網，因爲虛擬機通過實體機的eth1口上網，默認會使用xenbr1網橋。
我們可以修改每個虛擬機的配置文件，將bridge從xenbr1改爲xenbr0，如下粗體字：
vif = [ "mac=00:16:3e:0f:91:9f,bridge=xenbr1" ]
執行完上述步驟以後要關閉虛擬機，重讀配置文件後再次啓動。
也可以將xenbr1網橋和eth1網口脫鉤，轉而和eth0網卡掛接。我們通過修改配置文件就可以看到xen
是如何將邏輯橋接網口和物理網口掛接的。
在/etc/xen/xend-config.sxp中，我們可以看到如下注釋說明的內容：
#The bridge is named xenbr0, by default. To rename the bridge, use
# (network-script 'network-bridge bridge=<name>')
這一句話是說默認網橋的網橋是xenbr0，如果想採用非默認網橋，可以使用括號裏的格式；
我們用的eth1做網卡，所以配置文件應該是這樣的：
(network-script 'network-bridge netdev=eth1')
爲什麼使用eth1就是使用xenbr1哪？我們能不能讓xenbr1使用eth0網口哪？改成如下配置：
(network-script 'network-bridge netdev=eth0 bridge=xenbr1')
然後重啓xend服務即可。

(看起來好像必須要有pethx+xenbrx+你在configure 文件中用xenbrx虛擬機纔有網絡)

Remeber IOMMU和MMU區別:
- On some platforms, it is possible to make use of an Input/Output Memory

Management Unit (IOMMU ). This performs a similar feature to a standard MMU;
it maps between a physical and a virtual address space. The difference is the
application; whereas an MMU performs this mapping for applications running on
the CPU, the IOMMU performs it for devices.

When a page fault occurs, the block device driver can only perform DMA transfers into the bottom part of physical memory. If the page fault occurs

elsewhere, it must use the CPU to write the data, one word at a time, to the
correct address, which is very slow.

我的理解, 現在的設備有DMA功能, 但是只能在4GB on device的空間進行DMA, 當操作發生在4GB之外的地方就必須藉助CPU進行操作, 而通過IOMMU則可以在>4GB的地方進行DMA

Binary Rewriting
- The binary rewriting approach requires that the instruction stream be scanned

by the virtualization environment and privileged instructions identified. These are
then rewritten to point to their emulated versions. Performance from this approach is not ideal, particularly when doing anything
I/O intensive. In implementation, this is actually very similar to how a debugger works. For
a debugger to be useful, it must provide the ability to set breakpoints, which
will cause the running process to be interrupted and allow it to be inspected by
the user. A virtualization environment that uses this technique does something
similar. It inserts breakpoints on any jump and on any unsafe instruction. When
it gets to a jump, the instruction stream reader needs to quickly scan the next part
for unsafe instructions and mark them. When it reaches an unsafe instruction, it
has to emulate it.

Pentium and newer machines include a number of features to make implement-
ing a debugger easier. These features allow particular addresses to be marked, for
example, and the debugger automatically activated. These can be used when
writing a virtual machine that works in this way. Consider the hypothetical in-
struction stream in Figure 1.1. Here, two breakpoint registers would be used, DR0
and DR1, with values set to 4 and 8, respectively. When the first breakpoint is
reached, the system emulates the privileged instruction, sets the program counter
to 5, and resumes. When the second is reached, it scans the jump target and sets
the debug registers accordingly. Ideally, it caches these values, so the next time it
jumps to the same place it can just load the debug register values.

Xen never rewrites the binary.
This is something that VMWare (as far as I understand). To the best of my understanding (but I have never seen the VMWare source code), the method consists of basically doing runtime patching of code that needs to run differently - typically, this involves replacing an existing op-code with something else - either causing a trap to the hypervisor, or a replacement set of code that "does the right thing". If I understand how this works in VMWare is that the hypervisor "learns" the code by single-stepping through a block, and either applies binary patches or marks the section as "clear" (doesn't need changing). The next time this code gets executed, it has already been patched or is clear, so it can run at "full speed".

In Xen, using paravirtualization (ring compression), then the code in the OS has been modified to be aware of the virtualized environment, and as such is "trusted" to understand certain things. But the hypervisor will still trap for example writes to the page-table (otherwise someone could write a malicious kernel module that modifies the page-table to map in another guest's memory, or some such).

The HVM method does intercept CERTAIN instructions - but the rest of the code runs at normal full speed, thanks to the hardware support in modern processors, such as SVM in AMD and VMX in Intel processors. ARM has a similar technology in the latest models of their processors, but I'm not sure what the name of it is.

I'm not sure if I've answered quite all of your questions, if I've missed something, or it's not clear enough, feel free to ask...

rewriting happens at compile time (or design time), rather than at runtime.

Paravirtualization:
- From the perspective of an operating system, the biggest difference is that it

runs in ring 1 on a Xen system, instead of ring 0. This means that it cannot
perform any privileged instructions. In order to provide similar functionality, the
hypervisor exposes a set of hypercalls that correspond to the instructions. (Use hypercall)

A hypercall is conceptually similar to a system call. On UNIX3 systems, the

convention for invoking a system call is to push the values and then raise an
interrupt, or invoke a system call instruction if one exists. To issue the exit (0)
system call on FreeBSD, for example, you would execute a sequence of instructions
similar to that shown in Listing 1.1.

(MMU): 內存管理單元（英語：memory management unit，縮寫爲MMU），有時稱作分頁內存管理單元（英語：paged memory management unit，縮寫爲PMMU）。它是一種負責處理中央處理器（CPU）的內存訪問請求的計算機硬件。它的功能包括虛擬地址到物理地址的轉換（即虛擬內存管理）、內存保護、中央處理器高速緩存的控制，在較爲簡單的計算機體系結構中，負責總線的仲裁以及存儲體切換（bank switching，尤其是在8位的系統上）
ISA: Instruction Set Architecture
IVT adds a new mode to the processor, called VMX. A hypervisor can run in

VMX mode and be invisible to the operating system, running in ring 0. When
the CPU is in VMX mode, it looks normal from the perspective of an unmodified
OS. All instructions do what they would be expected to, from the perspective of
the guest, and there are no unexpected failures as long as the hypervisor correctly
performs the emulation.
A set of extra instructions is added that can be used by a process in VMX root
mode. These instructions do things like allocating a memory page on which to
store a full copy of the CPU state, start, and stop a VM. Finally, a set of bitmaps is
defined indicating whether a particular interrupt, instruction, or exception should
be passed to the virtual machine’s OS running in ring 0 or by the hypervisor
running in VMX root mode.
In addition to the features of Intel’s VT4, AMD’s Pacifica provides a few extra
things linked to the x86-64 extensions and to the Opteron architecture. Current
Opterons have an on-die memory controller. Because of the tight integration
between the memory controller and the CPU, it is possible for the hypervisor to
delegate some of the partitioning to the memory controller.

Using AMD-V, there are two ways in which the hypervisor can handle mem-
ory partitioning. In fact, two modes are provided. The first, Shadow Page Tables,
allows the hypervisor to trap whenever the guest OS attempts to modify its page
tables and change the mapping itself. This is done, in simple terms, by marking
the page tables as read only, and catching the resulting fault to the hypervisor,
instead of the guest operating system kernel. The second mode is a little more
complicated. Nested Page Tables allow a lot of this to be done in hardware.
Nested page tables do exactly what their name implies; they add another layer
of indirection to virtual memory. The MMU already handles virtual to physical
translations as defined by the OS. Now, these “physical” addresses are translated
to real physical addresses using another set of page tables defined by the hyper-
visor. Because the translation is done in hardware, it is almost as fast as normal
virtual memory lookups.

爲了讓一個體繫結構可以被虛擬化, Popek和Goldberg認爲所有的敏感指令必須同時是特權指令. 直觀地說, 任何指令用一種會影響其他進行的方式改變機器狀態的行爲都必須遭到hypervisor的阻止
虛擬化內存比較簡單: 只需要把內存分割爲多個區域, 然後每一個訪問物理內存的特權級指令發生陷入, 並由一個映射到所允許的內存區域的指令所代替. 一個現代的CPU包括一個內存管理但願MMU, 正是內存管理但願基於操作系統提供的信息實現了上述翻譯過程.
DMA 直接內存存取(DMA) 改善系統實時效能的一個熟知的方法是，額外提供一個邏輯模塊，在事件發生時產生響應，並允許處理器在較方便的時間來處理信息。這個DMA控制器通常將傳送到模塊的信息複製到內存(RAM)，並允許已處理的信息自動從內存移到外部外圍裝置。所有這些工作皆獨立於目前的CPU活動－詳見圖1。這種方式肯定有所助益，但其效益僅限於延遲必然發生的事件－CPU還是得在某一時間處理信息。S12X採用一個根本的方法，即提供「智能型DMA」控制器，不只移動資料，同時直接執行所有的處理工作。
Now, both Intel and AMD have added a set of instructions that makes virtu-

alization considerably easier for x86. AMD introduced AMD-V, formerly known
as Pacifica, whereas Intel’s extensions are known simply as (Intel) Virtualization
Technology (IVT or VT ). The idea behind these is to extend the x86 ISA to make
up for the shortcomings in the existing instruction set. Conceptually, they can be
thought of as adding a “ring -1” above ring 0, allowing the OS to stay where it
expects to be and catching attempts to access the hardware directly.

(XEN)
- xen: This package contains the Xen tools and management daemons needed to run virtual machines on x86, x86_64, and ia64 systems. Information on how to use Xen can be found at the Xen project pages. The Xen system also requires the Xen hypervisor and domain-0 kernel, which can be found in the kernel-xen* package.
- kernel-xen:
(XEN) xen爲什麼會有方案+機制分離, 或者XEN爲什麼想盡力減少hypervisor的代碼, 原因是: 如果xen出現bug則會危機到其上運行的虛擬機, 影響很大, 所以要儘可能減少其代碼量, 減少出錯的機率, 增加穩定性
(XEN) Early versions of Xen did a lot more in the hypervisor. Network multiplexing,

for example, was part of Xen 1.0, but was later moved into Domain 0. Most
operating systems already include very flexible features for bridging and tunnelling
virtual network interfaces, so it makes more sense to use these than implement
something new.

Another advantage of relying on Domain 0 features is ease of administration.

In the case of networks, a tool such as pf or iptables is incredibly complicated, and
a BSD or Linux administrator already has a significant amount of time and effort
invested in learning about it. Such an administrator can use Xen easily, since she
can re-use her existing knowledge.

(XEN)
- model=e1000: Ethernet 81540
- model=rtl8139: Ethernet 8139
- model=* : Ethernet 8139
(XEN) xm list - > Times: The Time column is deceptive. Virtual IO (network and block devices) used by Domains requires coordina-

tion by Domain0, which means that Domain0 is actually charged for much of the time that a DomainU is doing
IO. Use of this time value to determine relative utilizations by domains is thus very suspect, as a high
IO workload may show as less utilized than a high CPU workload. Consider yourself warned.

(XEN) disk

An array of block device stanzas, in the form:
disk = [ "stanza1", "stanza2", ... ]
Each stanza has 3 terms, separated by commas, "backend-dev,frontend-dev,mode".
backend-dev
The device in the backend domain that will be exported to the guest (frontend) domain. Supported formats
include:
phy:device - export the physical device listed. The device can be in symbolic form, as in sda7, or as the
hex major/minor number, as in 0x301 (which is hda1).
file://path/to/file - export the file listed as a loopback device. This will take care of the loopback
setup before exporting the device.
frontend-dev
How the device should appear in the guest domain. The device can be in symbolic form, as in sda7, or as
the hex major/minor number, as in 0x301 (which is hda1).
mode
The access mode for the device. There are currently 2 valid options, r (read-only), w (read/write).

(XEN) The output contains a variety of labels created by a user to define the rights and access privileges of certain domains. This is a part of sHype Xen Access Controls, which is another level of security an administrator may use. This is an optional feature of Xen, and in fact must be compiled into a Xen kernel if it is desired.
DomU 它通常不被允許執行任何能夠直接訪問硬件的hypercall, 雖然在某些情況下它被允許方位一個或更多的設備
IOMMU/MMU

http://blog.csdn.net/hotsolaris/archive/2007/08/08/1731839.aspx

IOMMU= input/output memory management unit
MMU = memory management unit 內存管理單元

The IOMMU is a MMU that connects a DMA-capable I/O bus to the primary storage memory.
Like the CPU memory management unit, an IOMMU take care of mapping virtual address(設備地址) to physical addresses and some units gurarantee memeory protection from misbehaving devices.

IOMMU, IO Memory Management Units, are hardware devices that translate device DMA addresses to machine addresses. An isolation capable IOMMU restricts a device so that it can only access parts of memory it has been explicitly granted access to.

把DMA控制器看成一個cpu，然後，想象這個cpu用IOMMU去訪問主存。與通常的cpu訪問主存一樣的方法。
IOMMU主要解決IO總線的寬度不夠。利用系統主存有64G，而外設總線用32 bit PCI，那麼外設的DMA控制器只能訪問主存的0－4G。而通常低地址內存都緊張，那麼DMA就用IOMMU把地址重新映射一下。

OS控制IOMMU和MMU
Memory protection from malicious or misbehaving devices: a device cannot read or write to memory that hasn't been explicitly allocated (mapped) for it. The memory protection is based on the fact that OS running on the CPU (see figure) exclusively controls both the MMU and the IOMMU. The devices are physically unable to circumvent or corrupt configured memory management tables.

(XEN)
- When an operating system boots, one of the first things it typically does is query

the firmware to find out a little about its surroundings. This includes things
like the amount of RAM available, what peripherals are connected, and what the
current time is.
A kernel booting in a Xen environment, however, does not have access to the
real firmware. Instead, there must be another mechanism. Much of the required
information is provided by shared memory pages. There are two of these: the
first is mapped into the guest’s address space by the domain builder at guest boot
time; the second must be explicitly mapped by the guest.

MMU:
- 4. MMU 請點評

現代操作系統普遍採用虛擬內存管理（Virtual Memory Management）機制，這需要處理器中的MMU（Memory Management Unit，內存管理單元）提供支持，本節簡要介紹MMU的作用。
首先引入兩個概念，虛擬地址和物理地址。如果處理器沒有MMU，或者有MMU但沒有啓用，CPU執行單元發出的內存地址將直接傳到芯片引腳上，被內存芯片（以下稱爲物理內存，以便與虛擬內存區分）接收，這稱爲物理地址（Physical Address，以下簡稱PA），如下圖所示。

圖 17.5. 物理地址
物理地址

如果處理器啓用了MMU，CPU執行單元發出的內存地址將被MMU截獲，從CPU到MMU的地址稱爲虛擬地址（Virtual Address，以下簡稱VA），而MMU將這個地址翻譯成另一個地址發到CPU芯片的外部地址引腳上，也就是將VA映射成PA，如下圖所示。
圖 17.6. 虛擬地址
虛擬地址

如果是32位處理器，則內地址總線是32位的，與CPU執行單元相連（圖中只是示意性地畫了4條地址線），而經過MMU轉換之後的外地址總線則不一定是32位的。也就是說，虛擬地址空間和物理地址空間是獨立的，32位處理器的虛擬地址空間是4GB，而物理地址空間既可以大於也可以小於4GB。

MMU將VA映射到PA是以頁（Page）爲單位的，32位處理器的頁尺寸通常是4KB。例如，MMU可以通過一個映射項將VA的一頁0xb7001000~0xb7001fff映射到PA的一頁0x2000~0x2fff，如果CPU執行單元要訪問虛擬地址0xb7001008，則實際訪問到的物理地址是0x2008。物理內存中的頁稱爲物理頁面或者頁幀（Page Frame）。虛擬內存的哪個頁面映射到物理內存的哪個頁幀是通過頁表（Page Table）來描述的，頁表保存在物理內存中，MMU會查找頁表來確定一個VA應該映射到什麼PA。

操作系統和MMU是這樣配合的：

操作系統在初始化或分配、釋放內存時會執行一些指令在物理內存中填寫頁表，然後用指令設置MMU，告訴MMU頁表在物理內存中的什麼位置。

設置好之後，CPU每次執行訪問內存的指令都會自動引發MMU做查表和地址轉換操作，地址轉換操作由硬件自動完成，不需要用指令控制MMU去做。

我們在程序中使用的變量和函數都有各自的地址，程序被編譯後，這些地址就成了指令中的地址，指令中的地址被CPU解釋執行，就成了CPU執行單元發出的內存地址，所以在啓用MMU的情況下，程序中使用的地址都是虛擬地址，都會引發MMU做查表和地址轉換操作。那爲什麼要設計這麼複雜的內存管理機制呢？多了一層VA到PA的轉換到底換來了什麼好處？All problems in computer science can be solved by another level of indirection.還記得這句話嗎？多了一層間接必然是爲了解決什麼問題的，等講完了必要的預備知識之後，將在第 5 節 “虛擬內存管理”討論虛擬內存管理機制的作用。

MMU除了做地址轉換之外，還提供內存保護機制。各種體系結構都有用戶模式（User Mode）和特權模式（Privileged Mode）之分，操作系統可以在頁表中設置每個內存頁面的訪問權限，有些頁面不允許訪問，有些頁面只有在CPU處於特權模式時才允許訪問，有些頁面在用戶模式和特權模式都可以訪問，訪問權限又分爲可讀、可寫和可執行三種。這樣設定好之後，當CPU要訪問一個VA時，MMU會檢查CPU當前處於用戶模式還是特權模式，訪問內存的目的是讀數據、寫數據還是取指令，如果和操作系統設定的頁面權限相符，就允許訪問，把它轉換成PA，否則不允許訪問，產生一個異常（Exception）。異常的處理過程和中斷類似，不同的是中斷由外部設備產生而異常由CPU內部產生，中斷產生的原因和CPU當前執行的指令無關，而異常的產生就是由於CPU當前執行的指令出了問題，例如訪問內存的指令被MMU檢查出權限錯誤，除法指令的除數爲0等都會產生異常。

圖 17.7. 處理器模式
處理器模式

通常操作系統把虛擬地址空間劃分爲用戶空間和內核空間，例如x86平臺的Linux系統虛擬地址空間是0x00000000~0xffffffff，前3GB（0x00000000~0xbfffffff）是用戶空間，後1GB（0xc0000000~0xffffffff）是內核空間。用戶程序加載到用戶空間，在用戶模式下執行，不能訪問內核中的數據，也不能跳轉到內核代碼中執行。這樣可以保護內核，如果一個進程訪問了非法地址，頂多這一個進程崩潰，而不會影響到內核和整個系統的穩定性。CPU在產生中斷或異常時不僅會跳轉到中斷或異常服務程序，還會自動切換模式，從用戶模式切換到特權模式，因此從中斷或異常服務程序可以跳轉到內核代碼中執行。事實上，整個內核就是由各種中斷和異常處理程序組成的。總結一下：在正常情況下處理器在用戶模式執行用戶程序，在中斷或異常情況下處理器切換到特權模式執行內核程序，處理完中斷或異常之後再返回用戶模式繼續執行用戶程序。

段錯誤我們已經遇到過很多次了，它是這樣產生的：

用戶程序要訪問的一個VA，經MMU檢查無權訪問。

MMU產生一個異常，CPU從用戶模式切換到特權模式，跳轉到內核代碼中執行異常服務程序。

內核把這個異常解釋爲段錯誤，把引發異常的進程終止掉。

——地址範圍、虛擬地址映射爲物理地址以及分頁機制

任何時候，計算機上都存在一個程序能夠產生的地址集合，我們稱之爲地址範圍。這個範圍的大小由CPU的位數決定，例如一個32位的CPU，它的地址範圍是0~0xFFFFFFFF （4G)，而對於一個64位的CPU，它的地址範圍爲0~0xFFFFFFFFFFFFFFFF （16E).這個範圍就是我們的程序能夠產生的地址範圍，我們把這個地址範圍稱爲虛擬地址空間，該空間中的某一個地址我們稱之爲虛擬地址。與虛擬地址空間和虛擬地址相對應的則是物理地址空間和物理地址，大多數時候我們的系統所具備的物理地址空間只是虛擬地址空間的一個子集。這裏舉一個最簡單的例子直觀地說明這兩者，對於一臺內存爲256M的32bit x86主機來說，它的虛擬地址空間範圍是0~0xFFFFFFFF（4G），而物理地址空間範圍是0x000000000~0x0FFFFFFF（256M）。
在沒有使用虛擬存儲器的機器上，虛擬地址被直接送到內存總線上，使具有相同地址的物理存儲器被讀寫；而在使用了虛擬存儲器的情況下，虛擬地址不是被直接送到內存地址總線上，而是送到存儲器管理單元MMU，把虛擬地址映射爲物理地址。
大多數使用虛擬存儲器的系統都使用一種稱爲分頁（paging）機制。虛擬地址空間劃分成稱爲頁（page）的單位，而相應的物理地址空間也被進行劃分，單位是頁幀(frame).頁和頁幀的大小必須相同。在這個例子中我們有一臺可以生成32位地址的機器，它的虛擬地址範圍從0~0xFFFFFFFF（4G），而這臺機器只有256M的物理地址，因此他可以運行4G的程序，但該程序不能一次性調入內存運行。這臺機器必須有一個達到可以存放4G程序的外部存儲器（例如磁盤或是FLASH），以保證程序片段在需要時可以被調用。在這個例子中，頁的大小爲4K，頁幀大小與頁相同——這點是必須保證的，因爲內存和外圍存儲器之間的傳輸總是以頁爲單位的。對應4G的虛擬地址和256M的物理存儲器，他們分別包含了1M個頁和64K個頁幀。
功能編輯
將線性地址映射爲物理地址
現代的多用戶多進程操作系統，需要MMU，才能達到每個用戶進程都擁有自己獨立的地址空間的目標。使用MMU,操作系統劃分出一段地址區域，在這塊地址區域中，每個進程看到的內容都不一定一樣。例如MICROSOFT WINDOWS操作系統將地址範圍4M-2G劃分爲用戶地址空間，進程A在地址0X400000（4M）映射了可執行文件，進程B同樣在地址0X400000（4M）映射了可執行文件，如果A進程讀地址0X400000，讀到的是A的可執行文件映射到RAM的內容，而進程B讀取地址0X400000時，則讀到的是B的可執行文件映射到RAM的內容。
這就是MMU在當中進行地址轉換所起的作用。
提供硬件機制的內存訪問授權
多年以來，微處理器一直帶有片上存儲器管理單元（MMU），MMU能使單個軟件線程工作於硬件保護地址空間。但是在許多商用實時操作系統中，即使系統中含有這些硬件也沒采用MMU。
當應用程序的所有線程共享同一存儲器空間時，任何一個線程將有意或無意地破壞其它線程的代碼、數據或堆棧。異常線程甚至可能破壞內核代碼或內部數據結構。例如線程中的指針錯誤就能輕易使整個系統崩潰，或至少導致系統工作異常。
就安全性和可靠性而言，基於進程的實時操作系統(RTOS）的性能更爲優越。爲生成具有單獨地址空間的進程，RTOS只需要生成一些基於RAM的數據結構並使MMU加強對這些數據結構的保護。基本思路是在每個關聯轉換中“接入”一組新的邏輯地址。MMU利用當前映射，將在指令調用或數據讀寫過程中使用的邏輯地址映射爲存儲器物理地址。MMU還標記對非法邏輯地址進行的訪問，這些非法邏輯地址並沒有映射到任何物理地址。
這些進程雖然增加了利用查詢表訪問存儲器所固有的系統開銷，但其實現的效益很高。在進程邊界處，疏忽或錯誤操作將不會出現，用戶接口線程中的缺陷並不會導致其它更關鍵線程的代碼或數據遭到破壞。目前在可靠性和安全性要求很高的複雜嵌入式系統中，仍然存在採無存儲器保護的操作系統的情況，這實在有些不可思議。
採用MMU還有利於選擇性地將頁面映射或解映射到邏輯地址空間。物理存儲器頁面映射至邏輯空間，以保持當前進程的代碼，其餘頁面則用於數據映射。類似地，物理存儲器頁面通過映射可保持進程的線程堆棧。RTOS可以在每個線程堆棧解映射之後，很容易地保留邏輯地址所對應的頁面內容。這樣，如果任何線程分配的堆棧發生溢出，將產生硬件存儲器保護故障，內核將掛起該線程，而不使其破壞位於該地址空間中的其它重要存儲器區，如另一線程堆棧。這不僅在線程之間，還在同一地址空間之間增加了存儲器保護。
存儲器保護（包括這類堆棧溢出檢測）在應用程序開發中通常非常有效。採用了存儲器保護，程序錯誤將產生異常並能被立即檢測，它由源代碼進行跟蹤。如果沒有存儲器保護，程序錯誤將導致一些細微的難以跟蹤的故障。實際上，由於在扁平存儲器模型中，RAM通常位於物理地址的零頁面，因此甚至NULL指針引用的解除都無法檢測到。
4MMU和CPU編輯
X86系列的MMU
INTEL出品的80386CPU或者更新的CPU中都集成有MMU. 可以提供32BIT共4G的地址空間.
X86 MMU提供的尋址模式有4K/2M/4M的PAGE模式（根據不同的CPU，提供不同的能力），此處提供的是目前大部分操作系統使用的4K分頁機制的描述，並且不提供ACCESS CHECK的部分。
涉及的寄存器
a) GDT
b) LDT
c) CR0
d) CR3
e) SEGMENT REGISTER
虛擬地址到物理地址的轉換步驟
a) SEGMENT REGISTER作爲GDT或者LDT的INDEX，取出對應的GDT/LDT ENTRY.
注意： SEGMENT是無法取消的，即使是FLAT模式下也是如此. 說FLAT模式下不使用SEGMENT REGISTER是錯誤的. 任意的RAM尋址指令中均有DEFAULT的SEGMENT假定. 除非使用SEGMENT OVERRIDE PREFⅨ來改變當前尋址指令的SEGMENT，否則使用的就是DEFAULT SEGMENT.
ENTRY格式
typedef struct
{
UINT16 limit_0_15;
UINT16 base_0_15;
UINT8 base_16_23;
UINT8 accessed : 1;
UINT8 readable : 1;
UINT8 conforming : 1;
UINT8 code_data : 1;
UINT8 app_system : 1;
UINT8 dpl : 2;
UINT8 present : 1;
UINT8 limit_16_19 : 4;
UINT8 unused : 1;
UINT8 always_0 : 1;
UINT8 seg_16_32 : 1;
UINT8 granularity : 1;
UINT8 base_24_31;
} CODE_SEG_DESCRIPTOR,*PCODE_SEG_DESCRIPTOR;
typedef struct
{
UINT16 limit_0_15;
UINT16 base_0_15;
UINT8 base_16_23;
UINT8 accessed : 1;
UINT8 writeable : 1;
UINT8 expanddown : 1;
UINT8 code_data : 1;
UINT8 app_system : 1;
UINT8 dpl : 2;
UINT8 present : 1;
UINT8 limit_16_19 : 4;
UINT8 unused : 1;
UINT8 always_0 : 1;
UINT8 seg_16_32 : 1;
UINT8 granularity : 1;
UINT8 base_24_31;
} DATA_SEG_DESCRIPTOR,*PDATA_SEG_DESCRIPTOR;
共有4種ENTRY格式，此處提供的是CODE SEGMENT和DATA SEGMENT的ENTRY格式. FLAT模式下的ENTRY在base_0_15,base_16_23處爲0，而limit_0_15,limit_16_19處爲0xfffff. granularity處爲1. 表名SEGMENT地址空間是從0到0XFFFFFFFF的4G的地址空間.
b) 從SEGMENT處取出BASE ADDRESS 和LIMIT. 將要訪問的ADDRESS首先進行ACCESS CHECK，是否超出SEGMENT的限制.
c) 將要訪問的ADDRESS+BASE ADDRESS，形成需要32BIT訪問的虛擬地址. 該地址被解釋成如下格式：
typedef struct
{
UINT32 offset :12;
UINT32 page_index :10;
UINT32 pdbr_index :10;
} VA,*LPVA;
d) pdbr_index作爲CR3的INDEX，獲得到一個如下定義的數據結構
typedef struct
{
UINT8 present :1;
UINT8 writable :1;
UINT8 supervisor :1;
UINT8 writethrough:1;
UINT8 cachedisable:1;
UINT8 accessed :1;
UINT8 reserved1 :1;
UINT8 pagesize :1;
UINT8 ignoreed :1;
UINT8 avl :3;
UINT8 ptadr_12_15 :4;
UINT16 ptadr_16_31;
}PDE,*LPPDE;
e) 從中取出PAGE TABLE的地址. 並且使用page_index作爲INDEX，得到如下數據結構
typedef struct
{
UINT8 present :1;
UINT8 writable :1;
UINT8 supervisor :1;
UINT8 writethrough:1;
UINT8 cachedisable:1;
UINT8 accessed :1;
UINT8 dirty :1;
UINT8 pta :1;
UINT8 global :1;
UINT8 avl :3;
UINT8 ptadr_12_15 :4;
UINT16 ptadr_16_31;
}PTE,*LPPTE;
f) 從PTE中獲得PAGE的真正物理地址的BASE ADDRESS. 此BASE ADDRESS表名了物理地址的.高20位. 加上虛擬地址的offset就是物理地址所在了.
ARM系列的MMU
ARM出品的CPU，MMU作爲一個協處理器存在。根據不同的系列有不同搭配。需要查詢DATASHEET纔可知道是否有MMU。如果有的話，一定是編號爲15的協處理器。可以提供32BIT共4G的地址空間。
ARM MMU提供的分頁機制有1K/4K/64K 3種模式. 本文介紹的是目前操作系統通常使用的4K模式。
涉及的寄存器，全部位於協處理器15.
ARM cpu地址轉換涉及三種地址：虛擬地址（VA，Virtual Address），變換後的虛擬地址（MVA，Modified Virtual Address），物理地址（PA，Physical Address）。沒有啓動MMU時，CPU核心、cache、MMU、外設等所有部件使用的都是物理地址。啓動MMU後，CPU核心對外發出的是虛擬地址VA，VA被轉換爲MVA供cache、MMU使用，並再次被轉換爲PA，最後使用PA讀取實際設備。
ARM沒有SEGMENT的寄存器，是真正的FLAT模式的CPU。給定一個ADDRESS，該地址可以被理解爲如下數據結構:
typedef struct
{
UINT32 offset :12;
UINT32 page_index :8;
UINT32 pdbr_index :12;
} VA,*LPVA;
從MMU寄存器2中取出BIT14-31，pdbr_index就是這個表的索引，每個入口爲4BYTE大小，結構爲
typedef struct
{
UINT32 type :2; //always set to 01b
UINT32 writebackcacheable:1;
UINT32 writethroughcacheable:1;
UINT32 ignore :1; //set to 1b always
UINT32 domain :4;
UINT32 reserved :1; //set 0
UINT32 base_addr:22;
} PDE,*LPPDE;
獲得的PDE地址，獲得如下結構的ARRAY，用page_index作爲索引，取出內容。
typedef struct
{
UINT32 type :2; //always set to 11b
UINT32 ignore :3; //set to 100b always
UINT32 domain :4;
UINT32 reserved :3; //set 0
UINT32 base_addr:20;
} PTE,*LPPTE;
從PTE中獲得的基地址和上offset，組成了物理地址.
PDE/PTE中其他的BIT，用於訪問控制。這邊講述的是一切正常，物理地址被正常組合出來的狀況。
ARM/X86 MMU使用上的差異
⒈X86始終是有SEGMENT的概念存在. 而ARM則沒有此概念（沒有SEGMENT REGISTER.).
⒉ARM有個DOMAIN的概念. 用於訪問授權. 這是X86所沒有的概念. 當通用OS嘗試同時適用於此2者的CPU上，一般會拋棄DOMAIN的使用.

(XEN)Shared Info page: 該頁面內容存放在一個結構體中, 在guest kernel中會生命一個只想該結構體的指針, 在guest啓動過程中, hypervisor會先將該結構體地址存入esi寄存器中, 然後在guest啓動過程中會來讀取該寄存器的值, 從而或者該頁的信息.
Red Hat Enterprise Virtualization (RHEV) is a complete enterprise virtualization management solution for server and desktop virtualization, based on Kernel-based Virtual Machine (KVM) technology.
In computing, hardware-assisted virtualization is a platform virtualization approach that enables efficient full virtualization using help from hardware capabilities, primarily from the host processors. Full virtualization is used to simulate a complete hardware environment, or virtual machine, in which an unmodified guest operating system (using the same instruction set as the host machine) executes in complete isolation. Hardware-assisted virtualization was added to x86 processors (Intel VT-x or AMD-V) in 2006.

Hardware-assisted virtualization is also known as accelerated virtualization; Xen calls it hardware virtual machine (HVM), and Virtual Iron calls it native virtualization.

(XEN)將虛擬網橋附加到網卡上:

#/etc/xen/scripts/network-bridge netdev=eth0 bridge=xenbr0 stop
#/etc/xen/scripts/network-bridge netdev=eth0 bridge=xenbr1 start

Software virtualization is unsupported by Red Hat Enterprise Linux.
Term of RHEV-H, RHEV-M, Hyper-V

Hyper-V: Windows virtualization platform

影子頁表和嵌套頁表(概念)
硬件虛擬話+1 Intel-EPT, AMD- NPT

EPT: Extended Page Tables
NPT: Nested Page Tables

Instead, targeted modifications are introduced to make it simpler and faster to support multiple guest operating systems. For example, the guest operating system might be modified to use a special hypercall application binary interface (ABI) instead of using certain architectural features that would normally be used. This means that only small changes are typically required in the guest operating systems, but any such changes make it difficult to support closed-source operating systems that are distributed in binary form only, such as Microsoft Windows. As in full virtualization, applications are typically still run unmodified. Figure 1.4 illustrates paravirtualization.

Major advantages include performance, scalability, and manageability. The two most common examples of this strategy are User-mode Linux (UML) and Xen. The choice of paravirtualization for Xen has been shown to achieve high performance and strong isolation even on typical desktop hardware.
Xen extends this model to device I/O. It exports simplified, generic device interfaces to guest operating systems. This is true of a Xen system even when it uses hardware support for virtualization allowing the guest operating systems to run unmodified. Only device drivers for the generic Xen devices need to be introduced into the system.

QEMUâ€” Another example of an emulator, but the ways in which QEMU is unlike Bochs are worth noting. QEMU supports two modes of operation. The first is the Full System Emulation mode, which is similar to Bochs in that it emulates a full personal computer (PC), including peripherals. This mode emulates a number of processor architectures, such as x86, x86_64, ARM, SPARC, PowerPC, and MIPS, with reasonable speed using dynamic translation. Using this mode, you can emulate an environment capable of hosting the Microsoft Windows operating systems (including XP) and Linux guests hosted on Linux, Solaris, and FreeBSD platforms.

Additional operating system combinations are also supported. The second mode of operation is known as User Mode Emulation. This mode is only available when executing on a Linux host and allows binaries for a different architecture to be executed. For instance, binaries compiled for the MIPS architecture can be executed on Linux executing on x86 architectures. Other architectures supported in this mode include SPARC, PowerPC, and ARM, with more in development. Xen relies on the QEMU device model for HVM guests.

模擬(Emulation): 完全將指令進行翻譯, 效率極其低下
VMware has a bare-metal product, ESX Server, . With VMware workstation, the hypervisor runs in hosted mode as an application installed on top of a base operating system such as Windows or Linux

The method of KVM operation is rather interesting. Each guest running on KVM is actually executed in user space of the host system. This approach makes each guest instance (a given guest kernel and its associated guest user space) look like a normal process to the underlying host kernel. Thus KVM has weaker isolation than other approaches we have discussed. With KVM, the well-tuned Linux process scheduler performs the hypervisor task of multiplexing across the virtual machines just as it would multiplex across user processes under normal operation. To accomplish this, KVM has introduced a new mode of execution that is distinct from the typical modes (kernel and user) found on a Linux system. This new mode designed for virtualized guests is aptly called guest mode. Guest mode has its own user and kernel modes. Guest mode is used when performing execution of all non-I/O guest code, and KVM falls back to normal user mode to support I/O operations for virtual guests.

The paravirtualization implementation known as User-mode Linux (UML) allows a Linux operating system to execute other Linux operating systems in user space

The Xen hypervisor sits above the physical hardware and presents guest domains with a virtual hardware interface

Intel-VT作用?

With hardware support for virtualization such as Intel's VT-x and AMD's AMD-V extensions, these additional protection rings become less critical. These extensions provide root and nonroot modes that each have rings 0 through 3. The Xen hypervisor can run in root mode while the guest OS runs in nonroot mode in the ring for which it was originally intended.

Domain0 runs a device driver specific to each actual physical device and then communicates with other guest domains through an asynchronous shared memory transport.

The physical device driver running in Domain0 or a driver domain is called a backend, and each guest with access to the device runs a generic device frontend driver. The backends provide each frontend with the illusion of a generic device that is dedicated to that domain. Backends also implement the sharing and multiple accesses required to give multiple guest domains the illusion of their own dedicated copy of the device.

Intel VT

Intel introduced a new set of hardware extensions called Virtualization Technology (VT), designed specifically to aid in virtualization of other operating systems allowing Xen to run unmodified guests. Intel added this technology to the IA-32 Platform and named it VT-x, and to the IA64 platforms and named it VT-i. With these new technologies, Intel introduced two new operation levels to the processor, for use by a hypervisor such as Xen. Intel maintains a list on its Web site of exactly which processors support this feature. This list is available at www.intel.com/products/processor_number/index.htm.
0

When using Intel VT technology, Xen executes in a new operational state called Virtual Machine Extensions (VMX) root operation mode. The unmodified guest domains execute in the other newly created CPU state, VMX non-root operation mode. Because the DomUs run in non-root operation mode, they are confined to a subset of operations available to the system hardware. Failure to adhere to the restricted subset of instructions causes a VM exit to occur, along with control returning to Xen.

AMD-V
Xen 3.0 also includes support for the AMD-V processor. One of AMD-V's benefits is a tagged translation lookaside buffer (TLB). Using this tagged TLB, guests get mapped to an address space that can be altogether different from what the VMM sets. The reason it is called a tagged TLB is that the TLB contains additional information known as address space identifiers (ASIDs). ASIDs ensure that a TLB flush does not need to occur at every context switch.

AMD also introduced a new technology to control access to I/O called I/O Memory Management Unit (IOMMU), which is analogous to Intel's VT-d technology. IOMMU is in charge of virtual machine I/O, including limiting DMA access to what is valid for the virtual machine, directly assigning real devices to VMs. One way to check if your processor is AMD-V capable from a Linux system is to check the output of the /proc/cpunfo for an svm flag. If the flag is present, you likely have an AMD-V processor.

HVM
Intel VT and AMD's AMD-V architectures are fairly similar and share many things in common conceptually, but their implementations are slightly different. It makes sense to provide a common interface layer to abstract their nuances away. Thus, the HVM interface was born. The original code for the HVM layer was implemented by an IBM Watson Research Center employee named Leendert van Doorn and was contributed to the upstream Xen project. A compatibility listing is located at the Xen Wiki at http://wiki.xensource.com/xenwiki/HVM_Compatible_Processors.

The HVM layer is implemented using a function call table (hvm_function_table), which contains the functions that are common to both hardware virtualization implementations. These methods, including initialize_guest_resources() and store_cpu_guest_regs(), are implemented differently for each backend under this unified interface.

Networking Devices

In general, network device support in Xen is based on the drivers found on your Domain0 guest. In fact the only things to worry about with networking devices are ensuring that your Domain0 kernel contains the necessary normal drivers for your hardware and that the Domain0 kernel also includes the Xen backend devices. If a non-Xen installation of the operating system you choose for Domain0 recognizes a particular network device, you should have no problem using that network device in Xen. Additionally, you may want to ensure that your Domain0 kernel includes support for ethernet bridging and loopback, if you want your DomainU kernels to obtain /dev/ethX devices. If your host kernel doesn't support bridging, or bridging does not work for you, you can select the alternate means of using IP routing in Domain0. This approach can also be useful if you want to isolate your DomainU guests from any external networks. Details on advanced network configurations are presented later in Chapter 10, "Network Configuration."

Xen device various:

physical(a hard drive or partition)
filesystem image or partitioned image: raw, qcow
standard network storage protocols such as: NBD, iSCSI, NFS etc

file:
phy:
tap:aio
tap:qcow

[host-a]#ls /sys/bus/
acpi/ i2c/ pci/ pcmcia/ pnp/ serio/ xen/
bluetooth/ ide/ pci_express/ platform/ scsi/ usb/ xen-backend/
[host-a]#ls /sys/bus/xen/drivers/
pcifront
[host-a]#ls /sys/bus/xen-backend/drivers/
tap vbd vif

xen back driver:
blktap, blkbk
netloop, netbk

loop是指拿文件來模擬塊設備
vfb Xen 底下的 VM 都是透過 VNC 來傳送畫面,所以這裡的 vfb(virtual framebuffer device) 就是設定系統畫面與輸入裝置 Keyboard/Mouse.
MX root operation（根虛擬化操作）和VMX non-root operation（非根虛擬化操作），統稱爲VMX操作模式

VT-x

擴展了傳統的x86處理器架構，它引入了兩種操作模式：VMX root operation（根虛擬化操作）和VMX non-root operation（非根虛擬化操作），統稱爲VMX操作模式。VMX root operation是VMM運行所處的模式, 設計給VMM/Hypervisor使用，其行爲跟傳統的IA32並無特別不同，而VMX non-root operation則是客戶機運行所處的模式,在VMM控制之下的IA32/64環境。所有的模式都能支持所有的四個Privileges levels。

由此，GDT、IDT、LDT、TSS等這些指令就能正常地運行於虛擬機內部了，而在以往，這些特權指令需要模擬運行。而VMM也能從模擬運行特權指令當中解放出來，這樣既能解決Ring Aliasing問題（軟件運行的實際Ring與設計運行的Ring不相同帶來的問題），又能解決Ring Compression問題，從而大大地提升運行效率。Ring Compression問題的解決，也就解決了64bit客戶操作系統的運行問題。

爲了建立這種兩個操作模式的架構，VT-x設計了一個Virtual-Machine Control Structure（VMCS，虛擬機控制結構）的數據結構，包括了Guest-State Area（客戶狀態區）和Host-State Area（主機狀態區），用來保存虛擬機以及主機的各種狀態參數，並提供了VM entry和VM exit兩種操作在虛擬機與VMM之間切換，用戶可以通過在VMCS的VM-execution control fields裏面指定在執行何種指令/發生何種事件的時候，VMX non-root operation環境下的虛擬機就執行VM exit，從而讓VMM獲得控制權，因此VT-x解決了虛擬機的隔離問題，又解決了性能問題

IOMMU, VT-d, VT-x, MMU, HVM , PV, DMA, TLB， QEMU

AMD-T

整體上跟VT-x相似，但是有些名字肯能不同：
VT-x將用於存放虛擬機狀態和控制信息的數據結構稱爲VMCS，而AMD叫VMCB
VT-x將TLB記錄中用於標記VM地址空間的字段爲VPID，而AMD-V稱爲ASID
VT-x: root 操作模式，非root操作模式. AMD-V guest操作模式，host操作模式

VMCS/VMCB包含了啓動和控制虛擬機的全部信息

guest/非root模式的意義在於其讓客戶操作系統處於完全不同的環境，而不需要改變操作系統的代碼, 在該模式上運行的特權指令即便是在Ring 0 上也變得可以被VMM截取. 此外VMM還可以通過VMCB中的各種截取控制字段選擇性的對指令和事情進行截取，或設置有條件的截取，所有的敏感的特權指令和非特權指令都在其控制之中。

VT-D

Intel VT-d技術是一種基於North Bridge北橋芯片的硬件輔助虛擬化技術，通過在北橋中內置提供DMA虛擬化和IRQ虛擬化硬件，實現了新型的I/O虛擬化方式，Intel VT-d能夠在虛擬環境中大大地提升 I/O 的可靠性、靈活性與性能。

傳統的IOMMUs（I/O memory management units，I/O內存管理單元）提供了一種集中的方式管理所有的DMA——除了傳統的內部DMA，還包括如AGP GART、TPT、RDMA over TCP/IP等這些特別的DMA，它通過在內存地址範圍來區別設備，因此容易實現，卻不容易實現DMA隔離，因此VT-d通過更新設計的IOMMU架構，實現了多個DMA保護區域的存在，最終實現了DMA虛擬化。這個技術也叫做DMA Remapping。

VT-d實現的中斷重映射可以支持所有的I/O源，包括IOAPICs，以及所有的中斷類型，如通常的MSI以及擴展的MSI-X。

VT-d進行的改動還有很多，如硬件緩衝、地址翻譯等，通過這些種種措施，VT-d實現了北橋芯片級別的I/O設備虛擬化。VT-d最終體現到虛擬化模型上的就是新增加了兩種設備虛擬化方式：

直接I/O設備分配, 虛擬機直接分配物理I/O設備給虛擬機，這個模型下，虛擬機內部的驅動程序直接和硬件設備直接通信，只需要經過少量，或者不經過VMM的管理。爲了系統的健壯性，需要硬件的虛擬化支持，以隔離和保護硬件資源只給指定的虛擬機使用，硬件同時還需要具備多個I/O容器分區來同時爲多個虛擬機服務，這個模型幾乎完全消除了在VMM中運行驅動程序的需求。例如CPU，雖然CPU不算是通常意義的I/O設備——不過它確實就是通過這種方式分配給虛擬機，當然CPU的資源還處在VMM的管理之下。

運用VT-d技術，虛擬機得以使用直接I/O設備分配方式或者I/O設備共享方式來代替傳統的設備模擬/額外設備接口方式，從而大大提升了虛擬化的I/O性能。

The Sun xVM Server network drivers uses a similar approach to the disk block driver for handling network packets. On DomU, the pseudo network driver xnf(xen-netfront) gets the I/O requests from the network stack and sends them to xnb(xen-netback) on Dom0. The back-end network driver xnb on Dom0 forwards packets sent by xnf to the native network driver.The buffer management for packet receiving has more impact on network performance than packet transmitting does. On the packet receiving end, the data is transferred via DMA into the native driver receiving buffer on dom0. Then, the packet is copied from the native driver buffer to the VMM buffer. The VMM buffer is then mapped to the DomU kernel address space without another copy of the data.

The sequence of operations for packet receiving is as follows:

1. Data is transferred via DMA into the native driver, bge, receive buffer ring.
2. The xnb drivers gets a new buffer from the VMM and copies data from the bge receive ring to the new buffer.
3. The xnb driver sends DomU an event through the event channel.
4. The xnf driver in DomU receives an interrupt.
5. The xnf driver maps a mblk(9S)to the VMM buffer and sends the mblk(9S) to the upper stack.