How Do Windows NT System Calls REALLY Work?--Windows NT的系統調用究竟是如何工作的?

出處: http://www.codeguru.com/Cpp/W-P/system/devicedriverdevelopment/article.php/c8035/

Most texts that describe Windows NT system calls keep many of the important details in the dark. This leads to confusion when trying to understand exactly what is going on when a user-mode application "calls into" kernel mode. The following article will shed light on the exact mechanism that Windows NT uses when switching to kernel-mode to execute a system service. The description is for an x86 compatible CPU running in protected mode. Other platforms supported by Windows NT will have a similar mechanism for switching to kernel-mode.

許多文章在描述Windows NT的系統調用時,忽略了很多重要的細節.不瞭解這些細節的話,那麼會妨礙我們確切地理解一個用戶態的程序如何能夠"調用進入"內核模式.接下來的文章將會揭示Windows NT通過什麼樣的機制切換到內核模式去執行系統服務.此處的描述適合於在保護模式下面運行的x86兼容的CPU.其它支持Windows NT的CPU應該也會有類似的切換內核模式的方式.

What is kernel-mode?(什麼是內核模式?)

Contrary to what most developers believe (even kernel-mode developers) there is no mode of the x86 CPU called "Kernel-mode". Other CPUs such as the Motorola 68000 has two processor modes "built into" the CPU, i.e. it has a flag in a status register that tells the CPU if it is currently executing in user-mode or supervisor-mode. Intel x86 CPUs do not have such a flag. Instead, it is the privilege level of the code segment that is currently executing that determines the privilege level of the executing program. Each code segment in an application that runs in protected mode on an x86 CPU is described by an 8 byte data structure called a Segment Descriptor. A segment descriptor contains (among other information) the start address of the code segment that is described by the descriptor, the length of the code segment and the privilege level that the code in the code segment will execute at. Code that executes in a code segment with a privilege level of 3 is said to run in user mode and code that executes in a code segment with a privilege level of 0 is said to execute in kernel mode. In other words, kernel-mode (privilege level 0) and user-mode (privilege level 3) are attributes of the code and not of the CPU. Intel calls privilege level 0 "Ring 0" and privilege level 3 "Ring 3". There are two more privilege levels in the x86 CPU that are not used by Windows NT (ring 1 and 2). The reason privilege levels 1 and 2 are not used is because Windows NT was designed to run on several other hardware platforms that may or may not have four privilege levels like the Intel x86 CPU.

與諸多開發者(甚至是內核模式的開發者)普遍認爲x86系列CPU沒有所謂的"內核模式"相比,其它的CPU比如Motorola 68000都在CPU裏面"內建"了兩種處理器模式,也就是說狀態寄存器有個標誌位可以讓CPU知道自己當前是在用戶模式還是在內核模式.Inter x86系列CPU確實沒有類似的標誌.取而代之的是程序的權限由當前執行的代碼段的權限來決定.在x86保護模式下面運行的程序的每一個代碼段都有一個對應的8字節大小的被稱爲段描述子的結構.一個段描述子包含了對應的代碼段的開始地址、代碼段的長度以及代碼段的權限.在一個權限爲3的代碼段裏面的代碼是運行在用戶模式下面,權限爲0的代碼段裏面的代碼是運行在內核模式下面.換句話說,內核模式(權限爲0)以及用戶模式(權限爲3)是代碼的屬性而不是CPU的.Inter稱權限0爲"Ring 0"而權限3爲"Ring 3".在x86系列CPU裏面還有兩個權限沒有被Windows NT使用到(ring 1和ring 2).原因是Windows NT設計爲可以在多個硬件平臺下面運行,而這些硬件平臺很可能不像Inter x86系列CPU那樣有四種權限.

The x86 CPU will not allow code that is running at a lower privilege level (numerically higher) to call into code that is running at a higher privilege level (numerically lower). If this is attempted a general protection (GP) exception is automatically generated by the CPU. A general protection exception handler in the operating system will be called and the appropriate action can be taken (warn the user, terminate the application etc). Note that all memory protection discussed above, including the privilege levels, are features of the x86 CPU and not of Windows NT. Without the support from the CPU Windows NT cannot implement memory protection like described above.

x86系列CPU不允許低權限的代碼段裏面的代碼調用進入高權限的代碼段裏面的代碼.如果試圖這麼做的話,就會引起一個CPU的一般性保護異常.這個一般性保護異常會被操作系統調用恰當的操作來處理(比如警告用戶,退出程序等等).注意所有的上面說到的內存保護以及權限保護都是x86系列CPU的特性而不是Windows NT的.沒有CPU的支持Windows NT是沒有辦法做到上面所說的內存保護以及權限保護.

Where do the Segment Descriptors reside?(段描述子存放在哪裏?)

Since each code segment that exists in the system is described by a segment descriptor and since there are potentially many, many code segments in a system (each program may have many) the segment descriptors must be stored somewhere so that the CPU can read them in order to accept or deny access to a program that wishes to execute code in a segment. Intel did not choose to store all this information on the CPU chip itself but instead in the main memory. There are two tables in main memory that store segment descriptors; the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). There are also two registers in the CPU that holds the addresses to and sizes of these descriptor tables so that the CPU can find the segment descriptors. These registers are the Global Descriptor Table Register (GDTR) and the Local Descriptor Table Register (LDTR). It is the operating system's responsibility to set up these descriptor tables and to load the GDTR and LDTR registers with the addresses of the GDT and LDT respectively. This has to be done very early in the boot process, even before the CPU is switched into protected mode, because without the descriptor tables no memory segments can be accessed in protected mode. Figure 1 below illustrates the relationship between the GDTR, LDTR, GDT and the LDT.

因爲每個代碼段都需要一個段描述子並且這些代碼段數量還可能會非常多,而這些段描述子必須存儲在某些CPU可以讀取的位置以便接受或者拒絕某個某個程序請求執行某個代碼段的代碼的要求.Inter並沒有選擇在CPU本身儲存這些信息,而是選擇在主存裏面保護.在主存裏面有兩張表:一張是全局描述子表(GDT),一張是局部描述子表(LDT).在CPU裏面也有兩個寄存器保存這兩張表的地址和大小以便CPU可以尋找到描述子.這兩個寄存器分別是全局描述子表寄存器(GDTR)和局部描述子表寄存器(LDTR).配置GDT、LDT以及GDTR、LDTR是操作系統的責任.這些會在啓動過程的前期完成,甚至比CPU切換到保護模式的時間還要早,因爲沒有GDT、LDT以及GDTR、LDTR是沒有辦法進入保護模式的,圖1描述了GDT、LDT以及GDTR、LDTR的關係.

Since there are two segment descriptor tables it is not enough to use an index to uniquely select a segment descriptor. A bit that identifies in which of the two tables the segment descriptor resides is necessary. The index combined with the table indicator bit is called a segment selector. The segment selector format is displayed below.

因爲這裏有兩個段描述子表,所以光用一個序號不足以表示一個唯一的段描述子.用一位來表示是在哪個描述子表裏面還是必要的.因此被稱作段選擇子索引裏面還組合了一個的指示位.段選擇子的結構如下所示:

As can be seen in figure 2 above, the segment selector also contains a two-bit field called a Requestor Privilege Level (RPL). These bits are used to determine if a certain piece of code can access the code segment descriptor that the selector points to. For instance, if a piece of code that runs at privilege level 3 (user mode) tries to make a jump or call code in the code segment that is described by the code segment descriptor that the selector points to and the RPL in the selector indicates that only code that runs at privilege level 0 can read the code segment a general protection exception occurs. This is the way the x86 CPU can make sure that no ring 3 (user mode) code can get access to ring 0 (kernel-mode) code. In fact, the truth is slightly more complicated than this. For the information-eager please see the further reading list, "Protected Mode Software Architecture" for the details of the RPL field. For our purposes it is enough to know that the RPL field is used for privilege checks of the code trying to use the segment selector to read a segment descriptor.

從圖2中可以看出,段選擇子還包括了長爲兩位的被稱作請求者權限的域(RPL),RPL用於判斷某個代碼是否能夠調用所描述的代碼段裏的代碼.舉例說明,如果在Ring 3執行的一段代碼試圖跳進或者調用進一個被段描述子裏面RPL指定爲ring 0的代碼段裏面的代碼,那麼會導致一般性異常的發生.這就是x86系列CPU保證ring 3的代碼無法進入ring 0的代碼的方法.實際上的做法要比這裏的複雜些.如果希望得到更多的信息請參考進一步的讀物<Protected Mode Software Architecture>裏面關於RPL的內容,當前來說知道RPL用作權限訪問保護的就可以了.

Interrupt gates(中斷門)

So if application code running in user-mode (at privilege level 3) cannot call code running in kernel-mode (at privilege level 0) how do system calls in Windows NT work? The answer again is that they use features of the CPU. In order to control transitions between code executing at different privilege levels, Windows NT uses a feature of the x86 CPU called an interrupt gate. In order to understand interrupt gates we must first understand how interrupts are used in an x86 CPU executing in protected mode.

所以在Ring 3運行的代碼無法調用Ring 0運行的代碼.那麼Windows NT的系統調用如何進行吶?答案還是利用CPU的特性,爲了在不同權限級別的代碼之間切換,Windows NT使用了x86系列CPU的Interrupt gates特性.爲了理解Interrupt gates我們首先要理解在x86 CPU的保護模式下面如何使用中斷.

Like most other CPUs, the x86 CPU has an interrupt vector table that contains information about how each interrupt should be handled. In real-mode, the x86 CPU's interrupt vector table simply contains pointers (4 byte values) to the Interrupt Service Routines that will handle the interrupts. In protected-mode, however, the interrupt vector table contains Interrupt Gate Descriptors which are 8 byte data structures that describe how the interrupt should be handled. An Interrupt Gate Descriptor contains information about what code segment the Interrupt Service Routine resides in and where in that code segment the ISR starts. The reason for having an Interrupt Gate Descriptor instead of a simple pointer in the interrupt vector table is the requirement that code executing in user-mode cannot directly call into kernel-mode. By checking the privilege level in the Interrupt Gate Descriptor the CPU can verify that the calling application is allowed to call the protected code at well defined locations (this is the reason for the name "Interrupt Gate", i.e. it is a well defined gate through which user-mode code can transfer control to kernel-mode code).

和其它大多數CPU類似,x86 CPU也有中斷向量表來描述每個中斷應該如何處理.在實模式下面,x86 CPU的中斷向量表簡單的包含了一個用於處理中斷的中斷服務函數(ISR)的地址(4位).但是在保護模式下面,中斷向量表裏麪包含的是Interrupt Gate描述子(8位)來描述如何處理對應中斷.一個Interrupt Gate描述子包含了ISR在哪個代碼段裏面以及ISR在代碼段裏面的開始地址.使用Interrupt Gate描述子來替代簡單的ISR地址是爲了保證Ring 3的代碼不能直接譖越調用Ring 0的代碼.通過檢查Interrupt Gate描述子裏面的執行權限CPU可以確認程序是否可以通過特定位置的代碼來調用被保護的代碼(這也是被稱爲Interrupt Gate的原因,也就是說它是一個具有良好行爲定義的可以控制用戶模式向內核模式轉換的門).

The Interrupt Gate Descriptor contains a Segment Selector which uniquely defines the Code Segment Descriptor that describes the code segment that contains the Interrupt Service Routine. In the case of our Windows NT system call, the segment selector points to a Code Segment Descriptor in the Global Descriptor Table. The Global Descriptor Table contains all Segment Descriptors that are "global", i.e. that are not associated with any particular process running in the system (in other words, the GDT contains Segment Descriptors that describe operating system code and data segments). See figure 3 below for the relationship between the Interrupt Descriptor Table Entry associated with the 'int 2e' instruction, the Global Descriptor Table Entry and the Interrupt Service Routine in the target code segment.

Interrupt Gate描述子裏面包含一個段選擇子指向該ISR所在段的段描述子.在Windows NT的系統調用的情況下,段選擇子指向的段描述子是在GDT裏面.GDT裏面包含所有的"全局"段描述子,也就是說,不是隻與某個單獨進程相關(換句話說,GDT裏面存放的是描述系統的代碼以及數據段的段描述子).下面的圖3說明了包含"int 2e"命令的中斷描述表項(IDT)與對應的ISR所在的GDT表項的關係.

Back to the NT system call

Now after having covered the background material we are ready to describe exactly how a Windows NT system call finds its way from user-mode into kernel-mode. System calls in Windows NT are initiated by executing an "int 2e" instruction. The 'int' instructor causes the CPU to execute a software interrupt, i.e. it will go into the Interrupt Descriptor Table at index 2e and read the Interrupt Gate Descriptor at that location. The Interrupt Gate Descriptor contains the Segment Selector of the Code Segment that contains the Interrupt Service Routine (the ISR). It also contains the offset to the ISR within the target code segment. The CPU will use the Segment Selector in the Interrupt Gate Descriptor to index into the GDT or LDT (depending on the TI-bit in the segment selector). Once the CPU knows the information in the target segment descriptor it loads the information from the segment descriptor into the CPU. It also loads the EIP register from the Offset in the Interrupt Gate Descriptor. At this point the CPU is almost set up to start executing the ISR code in the kernel-mode code segment.

介紹完有關背景知識之後,我們來探討Windows NT的系統調用如何從Ring 3進去Ring 0.Windows NT的系統調用是由"int 2e"指令的執行引發的."int"指令導致CPU執行一個軟件中斷,也就是,它將會從IDT裏面索引爲2e的項中讀取出Interrupt Gate描述子.Interrupt Gate描述子包含了ISR所在段的段描述子,也包含了ISR在段裏面的偏移.CPU將會使用此段描述子從GDT或者LDT(根據TI域)裏面獲取信息.一旦CPU獲取了目標段描述子的信息後,將會把目標段描述子信息信息加載進CPU.CPU會根據Interrupt Gate描述子裏面的偏移設置EIP寄存器.這樣CPU就準備好了執行在RIng 0代碼段裏的ISR(土星按:中斷和調用不一樣,中斷不會受段選擇子裏面的RPL限制,而調用要受限制).

The CPU switches automatically to the kernel-mode stack

Before the CPU starts to execute the ISR in the kernel-mode code segment, it needs to switch to the kernel-mode stack. The reason for this is that the kernel-mode code cannot trust the user-mode stack to have enough room to execute the kernel-mode code. For instance, malicious user-mode code could modify its stack pointer to point to invalid memory, execute an 'int 2e' instruction and thereby crash the system when the kernel-mode functions uses the invalid stack pointer. Each privilege level in the x86 Protected Mode environment therefore has its own stack. When making function calls to a higher-privileged level through an interrupt gate descriptor like described above, the CPU automatically saves the user-mode program's SS, ESP, EFLAGS, CS and EIP registers on the kernel-mode stack. In the case of our Windows NT system service dispatcher function (KiSystemService) it needs access to the parameters that the user-mode code pushed onto its stack before it called 'int 2e'. By convention, the user-mode code must set up the EBX register to contain a pointer to the user-mode stack's parameters before executing the 'int 2e' instruction. The KiSystemService can then simply copy over as many arguments as the called system function needs from the user-mode stack to the kernel-mode stack before calling the system function. See figure 4 below for an illustration of this.

在CPU執行Ring0代碼段裏面的ISR之前,還需要切換到Ring0的堆棧.這樣做的原因是Ring0的代碼不能確定Ring3的堆棧是否可以提供足夠的空間來運行.比如某些惡意的Ring3代碼可能會修改自己的堆棧指針使之指向無效地址,那麼執行"int 2e"指令時使用無效指針的Ring0代碼會導致系統崩潰.因此在x86保護模式下面每個權限都有自己的堆棧.當像上所述的通過Interrupt Gate描述子調用更高權限級別的代碼時,CPU會自動保存Ring3的SS、ESP、EFLAGS、CS以及EIP寄存器到Ring0的堆棧裏面去.在Windows NT系統裏系統服務分發函數(KiSystemService)還需要能夠取得"int 2e"調用前推入堆棧的參數.爲了方便,Ring3必須把指向Ring3堆棧裏面的參數指針在調用"int 2e"指令的時候放置到EBX裏面去.KiSystemService這樣就可以簡單的在系統服務調用之前根據被調用的系統服務所需要的參數個數把參數從Ring3的堆棧裏面拷貝到Ring0的堆棧裏面去.可以可以參考圖4.

What system call are we calling?

Since all Windows NT system calls use the same 'int 2e' software interrupt to switch into kernel-mode, how does the user-mode code tell the kernel-mode code what system function to execute? The answer is that an index is placed in the EAX register before the int 2e instruction is executed. The kernel-mode ISR looks in the EAX register and calls the specified kernel-mode function if all parameters passed from user-mode appears to be correct. The call parameters (for instance passed to our OpenFile function) are passed to the kernel-mode function by the ISR.

因爲所有的Windows NT的系統調用都通過"int 2e"軟件中斷切換到內核模式裏面去,那麼系統怎麼知道執行哪個系統函數呢?答案是在"int 2e"執行前EAX寄存器裏面就保存了一個索引.Ring0下面的ISR根據EAX裏面的值調用對應的RIng0函數(如果參數也都正確的話).參數(比如傳給OpenFile的參數)都通過ISR傳給Ring0的函數.

Returning from the system call

Once the system call has completed the CPU automatically restores the running program's original registers by executing an IRET instruction. This pops all the saved register values from the kernel-mode stack and causes the CPU to continue the execution at the point in the user-mode code next after the 'int 2e' call.

當系統調用完成,CPU就自動使用IRET指令恢復之前的寄存器狀態.這個指令會從Ring0的堆棧中推出之前保存的所有值並跳到"int 2e"下面一句繼續運行.

Experiment

By examining the Interrupt Gate Descriptor for entry 2e in the Interrupt Descriptor Table we can confirm that the CPU finds the Windows NT system service dispatcher routine like described in this article. The code sample for this article contains a debugger extension for the WinDbg kernel-mode debugger that dumps out a descriptor in the GDT, LDT or IDT.

通過對IDT裏面2e項的Interrupt Gate描述子觀察,我們可以肯定CPU在分發系統調用的時候就如本文所說,本文的離子代碼裏面包含了一個WinDbg的擴展,可以導出了GDT,LDT以及IDT的描述(土星按:原文的例子文件損壞了).

The WinDbg debugger extension is a DLL called 'protmode.dll' (Protected Mode). It is loaded into WinDbg by using the following command: ".load protmode.dll" after having copied the DLL into the directory that contains the kdextx86.dll for our target platform. Break into the WinDbg debugger (CTRL-C) once you are connected to your target platform. The syntax for displaying the IDT descriptor for 'int 2e' is "!descriptor IDT 2e". This dumps out the following information:

這個WinDbg的擴展是個名爲'protmode.dll' 的DLL.把此DLL拷貝到目標平臺上包含有kdextx86.dll的目錄下面之後通過WinDbg的命令 ".load protmode.dll"加載.生成"int 2e"的IDT描述的命令是"!descriptor IDT 2e".下面是轉存出來的結果:

kd>!descriptor IDT 2e
------------------- Interrupt Gate Descriptor --------------------
IDT base = 0x80036400, Index =    0x2e, Descriptor @ 0x80036570
80036570 c0 62 08 00 00 ee 46 80
Segment is present, DPL = 3, System segment, 32-bit descriptor
Target code segment selector =    0x0008 (GDT Index = 1, RPL = 0)
Target code segment offset =      0x804662c0
------------------- Code Segment Descriptor --------------------
GDT base = 0x80036000, Index =    0x01, Descriptor @ 0x80036008
80036008 ff ff 00 00 00 9b cf 00
Segment size is in 4KB pages, 32-bit default operand and data size
Segment is present, DPL =         0, Not system segment, Code segment
Segment is not conforming, Segment is readable, Segment is accessed
Target code segment base address =     0x00000000
Target code segment size = 0x000fffff

The 'descriptor' command reveals the following:

The descriptor at index 2e in the IDT is at address 0x80036570.
The raw descriptor data is C0 62 08 00 00 EE 46 80.
This means that:
- The segment that contains the Code Segment Descriptor described by the Interrupt Gate Descriptor's Segment Selector is present.
- Code running at least privilege level 3 can access this Interrupt Gate.
- The Segment that contains the interrupt handler for our system call (2e) is described by a Segment Descriptor residing at index 1 in the GDT.
- The KiSystemService starts at offset 0x804552c0 within the target segment.

這個"descriptor "指令揭示了以下含義

IDT的2e項位於地址0x80036570
Interrupt Gate描述子的原始數據爲C0 62 08 00 00 EE 46 80.
含義如下
- 段選擇子所指向的段已經存在(土星按:加載在內存中)
- 在Ring3運行的代碼可以獲得這個Interrupt Gate
- 該中斷的處理函數所在段的段描述子位於在GDT裏的1項
- KiSystemService從該段的偏移0x804552c0開始執行

The "!descriptor IDT 2e" command also dumps out the target code segment descriptor at index 1 in the GDT. This is an explanation of the data dumped from the GDT descriptor:

The Code Segment Descriptor at index 1 in the GDT is at address 0x80036008.
The raw descriptor data is FF FF 00 00 00 9B CF 00.
This means that:
- The size is in 4KB pages. What this means is that the size field (0x000fffff) should be multiplied with the virtual memory page size (4096 bytes) to get the actual size of the segment described by the descriptor. This yields 4GB which happens to be the size of the full address space which can be accessed from kernel-mode. In other words, the whole 4GB address space is described by this segment descriptor. This is the reason kernel-mode code can access any address in user-mode as well as in kernel-mode.
- The segment is a kernel-mode segment (DPL=0).
- The segment is not conforming. See further reading, "Protected Mode Software Architecture" for a full discussion of this field.
- The segment is readable. This means that code can read from the segment. This is used for memory protection. See further reading, "Protected Mode Software Architecture" for a full discussion of this field.
- The segment has been accessed. See further reading, "Protected Mode Software Architecture" for a full discussion of this field.

"!descriptor IDT 2e" 命令也轉存出了GDT中1項的中的段描述子.下面是對數據的解釋:

GDT中1項所在地址爲0x80036008.
段描述子原始數據是FF FF 00 00 00 9B CF 00.
含義如下
- 頁面大小是4KB.也就是說大小(0x000fffff)需要乘上頁面大小 (4096 bytes) 來得到描述的段大小. 這也是4GB大小,正好是內核模式下面所能訪問的最大地址. 換句話說,整個4GB空間都被此段描述子包含.這也就是Ring0下面可以訪問任意地址--包括Ring3以及Ring0--的原因.
- 此段是Ring0段(DPL=0).
- 此段不一致(Conforming?).參考 "Protected Mode Software Architecture".
- 此段可讀.也就是說代碼可以讀此段,這個用於內存保護.參見"Protected Mode Software Architecture".
- 此段已經被訪問過了.參見"Protected Mode Software Architecture".

土星按:後面的就不翻譯了,主要內容已經介紹完畢

To build the ProtMode.dll WinDbg debugger extension DLL, open the project in Visual Studio 6.0 and click build. For an introduction of how to create debugger extensions like ProtMode.dll, see the SDK that comes with the "Debugging Tools for Windows" which is a free download from Microsoft.

How Do Windows NT System Calls REALLY Work?--Windows NT的系統調用究竟是如何工作的?

What is kernel-mode?(什麼是內核模式?)

Where do the Segment Descriptors reside?(段描述子存放在哪裏?)

What system call are we calling?

Returning from the system call

Further Reading

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

安裝chromadb注意事項

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

How Do Windows NT System Calls REALLY Work?--Windows NT的系統調用究竟是如何工作的?

打造對抗 OpenProcess 檢測的 OD

Windows NT下和Windows 9x下DLL的存在形式

利用SEH異常清硬件斷點

終於畢業了,另外找到新房子了,準備擇取黃道吉日,搬家

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結