本文探究在AArch64平臺,Linux內核任務切換的實現機制。使用的調試工具主要爲gdb及qemu虛擬機,調試的內核版本爲5.3.12。筆者在實際工作中遇到一些互斥鎖的操作;當一個進程或內核線程嘗試對互斥鎖加鎖時,若該鎖已被鎖住,該進程或內核線程就會進入掛起、阻塞狀態。此過程就會引發任務切換,通過調用kernel/sched/core.c中的schedule函數,逐步切換到其他的內核任務。
一,64位ARM調用規則,Procedure Call Standard for the ARM 64-bit Architecture
該規則規定了過程調用發生時,調用者與被調用者之間的參數傳遞、返回值、寄存器保存等需要遵守的規則。當前最新版本的文檔可以在Github上獲取。我們關心的是通用寄存器的保存規則,截圖如下:
然而爲什麼任務切換需要遵循此調用規則呢?任務切換最重要的是進程及內核線程的上下文狀態信息的保存與恢復,即上下文切換,在kernel/sched/core.c中定義了一個內聯函數,context_switch,由此可見一斑。任務切換與內核驅動開發常常遇到的中斷處理不同,中斷是異步的,被中斷執行的代碼位置不固定的,且是被動的;而任務切換則是主動的,有着特定的調用流程。Linux內核以彙編函數的方式實現了上下文切換功能,自然需要遵循上圖提及的過程調用規則(Procedure Call Standard)。
二,任務切換的核心
如下圖,定義於arch/arm64/kernel/entry.S中的彙編函數cpu_switch_to,精練且完美地實現了上面提到的調用規則;它首先保存了舊任務的x19至x30寄存器,之後恢復了新任務保存的x19至x30寄存器數據:
首先,THREAD_CPU_CONTEXT是一個宏定義,包含此宏的頭文件在編譯Linux時動態地生成。此宏是thread_struct結構體在task_struct結構體中的偏移量,單位爲字節(詳見include/linux/sched.h中task_struct結構體的定義)。這樣x0與x10相加的結果保存至x8寄存器,x8寄存器就指向了AArch64平臺的thread_struct結構體,如下圖:
而thread_struct的第一個成員變量是cpu_context,接下來的stp/str指令就保存了64位ARM調用規則規定的、需要保存的通用寄存器信息。對此彙編代碼存在一個疑問,即爲何需要將棧寄存器sp複製到x9寄存器中呢?爲什麼不直接執行stp x29, sp, [x8], #16 呢?下圖的操作解釋了其中的緣故:ARM彙編器無法生成相對應的指令。
之後的ldp/ldr等彙編指令將新的任務上下文恢復到調用規則指定的寄存器中,這些操作也非常乾淨利索。最後將x1寄存器寫入sp_el0寄存器中。這樣的操作是因爲Linux內核通過sp_el0寄存器快速、方便地獲取當前進程、內核線程的task_struct指針:
三,調試內核任務切換
使用gdb/qemu可以方便地調試Linux內核,下面列出調試的過程:
(gdb) info address cpu_switch_to
Symbol "cpu_switch_to" is at 0xffffff801008552c in a file compiled without debugging.
(gdb) disassemble cpu_switch_to
Dump of assembler code for function cpu_switch_to:
0xffffff801008552c <+0>: mov x10, #0x7d0 // #2000
0xffffff8010085530 <+4>: add x8, x0, x10
0xffffff8010085534 <+8>: mov x9, sp
0xffffff8010085538 <+12>: stp x19, x20, [x8], #16
0xffffff801008553c <+16>: stp x21, x22, [x8], #16
0xffffff8010085540 <+20>: stp x23, x24, [x8], #16
0xffffff8010085544 <+24>: stp x25, x26, [x8], #16
0xffffff8010085548 <+28>: stp x27, x28, [x8], #16
0xffffff801008554c <+32>: stp x29, x9, [x8], #16
0xffffff8010085550 <+36>: str x30, [x8]
0xffffff8010085554 <+40>: add x8, x1, x10
0xffffff8010085558 <+44>: ldp x19, x20, [x8], #16
0xffffff801008555c <+48>: ldp x21, x22, [x8], #16
0xffffff8010085560 <+52>: ldp x23, x24, [x8], #16
0xffffff8010085564 <+56>: ldp x25, x26, [x8], #16
0xffffff8010085568 <+60>: ldp x27, x28, [x8], #16
0xffffff801008556c <+64>: ldp x29, x9, [x8], #16
0xffffff8010085570 <+68>: ldr x30, [x8]
0xffffff8010085574 <+72>: mov sp, x9
0xffffff8010085578 <+76>: msr sp_el0, x1
0xffffff801008557c <+80>: ret
End of assembler dump.
(gdb) break *0xffffff801008552c
Breakpoint 1 at 0xffffff801008552c: file arch/arm64/kernel/entry.S, line 1138.
(gdb) break *0xffffff8010085578
Breakpoint 2 at 0xffffff8010085578: file arch/arm64/kernel/entry.S, line 1157.
(gdb) c
Continuing.
[Switching to Thread 1.2]
Thread 2 hit Breakpoint 1, cpu_switch_to () at arch/arm64/kernel/entry.S:1138
1138 mov x10, #THREAD_CPU_CONTEXT
(gdb) bt
#0 cpu_switch_to () at arch/arm64/kernel/entry.S:1138
#1 0xffffff80100878dc in __switch_to (prev=0xffffffc00e880c00, next=0xffffffc00e83c800) at arch/arm64/kernel/process.c:509
#2 0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#3 __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#4 0xffffff8010563edc in schedule_idle () at kernel/sched/core.c:4016
#5 0xffffff80100d2604 in do_idle () at kernel/sched/idle.c:288
#6 0xffffff80100d27e4 in cpu_startup_entry (state=CPUHP_AP_ONLINE_IDLE) at kernel/sched/idle.c:355
#7 0xffffff80100944e8 in secondary_start_kernel () at arch/arm64/kernel/smp.c:259
#8 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) c
Continuing.
Thread 2 hit Breakpoint 2, cpu_switch_to () at arch/arm64/kernel/entry.S:1157
1157 msr sp_el0, x1
(gdb) info register lr
lr 0xffffff80100878dc -549486823204
(gdb) break *0xffffff80100878dc
Breakpoint 3 at 0xffffff80100878dc: file arch/arm64/kernel/process.c, line 512.
(gdb) c
Continuing.
Thread 2 hit Breakpoint 3, __switch_to (prev=0xffffffc00e83c800, next=0xffffffc00e880c00) at arch/arm64/kernel/process.c:512
512 }
(gdb) bt
#0 __switch_to (prev=0xffffffc00e83c800, next=0xffffffc00e880c00) at arch/arm64/kernel/process.c:512
#1 0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#2 __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#3 0xffffff8010563b70 in schedule () at kernel/sched/core.c:3988
#4 0xffffff80100bee54 in worker_thread (__worker=0xffffffc00e862000) at kernel/workqueue.c:2436
#5 0xffffff80100c571c in kthread (_create=0xffffffc00e855180) at kernel/kthread.c:255
#6 0xffffff8010085590 in ret_from_fork () at arch/arm64/kernel/entry.S:1169
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) i r cpsr
cpsr 0x60000085 1610612869
(gdb) !bitdump 0x60000085
0x60000085 (0x60000085): ->
31 27 23 19 15 11 7 3
0110 0000 0000 0000 0000 0000 1000 0101
28 24 20 16 12 8 4 0
-----------------------------------------------
(gdb) info register SP_EL0
SP_EL0 0xffffffc00e83c800 -274634389504
觀察上面的調試記錄,兩次執行gdb的backtrace指令(bt),輸出的調用回溯不相同,這正是任務切換成功的證據:不同的任務調用回溯通常是不同的。此外,在任務切換時,CPSR寄存器的第7個比特位置爲1,表明屏蔽了IRQ中斷。這樣如此精巧的上下文切換操作不會被中斷,也就更安全了。
四,內核對浮點運算支持
在《Linux kernel development》一書中,作者指出Linux內核“No (easy) Use of foating point”,並建議不要在內核中加入浮點運算的代碼。對於ARM32位處理器,它經歷了多年的更新,早期的ARM核沒有浮點運算的協處理器,因此內核中不支持浮點運算。然而對於ARM64位處理器,浮點運算功能是必備的。在64位ARM調用規則中,也指出了過程調用需要保存的浮點運算寄存器:
對於這一點,內核開發人員的設計比較乾脆:Linux內核在任務切換時,保存了所有的浮點寄存器。該操作由fpsimd_save彙編宏實現:
通過調試可以確認:
(gdb) info address fpsimd_save_state
Symbol "fpsimd_save_state" is at 0xffffff8010086e20 in a file compiled without debugging.
(gdb) break *0xffffff8010086e20
Breakpoint 5 at 0xffffff8010086e20: file arch/arm64/kernel/entry-fpsimd.S, line 20.
(gdb) info address fpsimd_load_state
Symbol "fpsimd_load_state" is at 0xffffff8010086e74 in a file compiled without debugging.
(gdb) break *0xffffff8010086e74
Breakpoint 6 at 0xffffff8010086e74: file arch/arm64/kernel/entry-fpsimd.S, line 30.
(gdb) c
Continuing.
[Switching to Thread 1.1]
Thread 1 hit Breakpoint 5, fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
20 fpsimd_save x0, 8
(gdb) bt
#0 fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
#1 0xffffff80100859d8 in fpsimd_save () at arch/arm64/kernel/fpsimd.c:310
#2 0xffffff8010086264 in fpsimd_thread_switch (next=0xffffffc00e9cbc00) at arch/arm64/kernel/fpsimd.c:991
#3 0xffffff8010087738 in __switch_to (prev=0xffffffc00e9c1800, next=0xffffffc00e9cbc00) at arch/arm64/kernel/process.c:491
#4 0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#5 __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#6 0xffffff8010563b70 in schedule () at kernel/sched/core.c:3988
#7 0xffffff801008c94c in do_notify_resume (regs=0xffffff8010953ec0, thread_flags=2) at arch/arm64/kernel/signal.c:917
#8 0xffffff8010084060 in work_pending () at arch/arm64/kernel/entry.S:979
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) c
Continuing.
Thread 1 hit Breakpoint 5, fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
20 fpsimd_save x0, 8
(gdb) bt
#0 fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
#1 0xffffff80100859d8 in fpsimd_save () at arch/arm64/kernel/fpsimd.c:310
#2 0xffffff8010086264 in fpsimd_thread_switch (next=0xffffff8010750340 <init_task>) at arch/arm64/kernel/fpsimd.c:991
#3 0xffffff8010087738 in __switch_to (prev=0xffffffc00e9c1800, next=0xffffff8010750340 <init_task>) at arch/arm64/kernel/process.c:491
#4 0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#5 __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#6 0xffffff8010563b70 in schedule () at kernel/sched/core.c:3988
#7 0xffffff8010567924 in schedule_hrtimeout_range_clock (expires=<optimized out>, delta=<optimized out>, mode=<optimized out>, clock_id=<optimized out>) at kernel/time/hrtimer.c:1926
#8 0xffffff8010567948 in schedule_hrtimeout_range (expires=<optimized out>, delta=<optimized out>, mode=<optimized out>) at kernel/time/hrtimer.c:1983
#9 0xffffff80101db87c in poll_schedule_timeout (state=<optimized out>, slack=<optimized out>, expires=<optimized out>, pwq=<optimized out>) at fs/select.c:243
#10 do_poll (end_time=<optimized out>, wait=<optimized out>, list=<optimized out>) at fs/select.c:951
#11 do_sys_poll (ufds=<optimized out>, nfds=<optimized out>, end_time=<optimized out>) at fs/select.c:1001
#12 0xffffff80101dc6d4 in __do_sys_ppoll (sigsetsize=<optimized out>, sigmask=<optimized out>, tsp=<optimized out>, nfds=<optimized out>, ufds=<optimized out>) at fs/select.c:1101
#13 __se_sys_ppoll (sigsetsize=<optimized out>, sigmask=<optimized out>, tsp=<optimized out>, nfds=<optimized out>, ufds=<optimized out>) at fs/select.c:1081
#14 __arm64_sys_ppoll (regs=<optimized out>) at fs/select.c:1081
#15 0xffffff80100952e4 in __invoke_syscall (syscall_fn=<optimized out>, regs=<optimized out>) at arch/arm64/kernel/syscall.c:36
#16 invoke_syscall (syscall_table=<optimized out>, sc_nr=<optimized out>, scno=<optimized out>, regs=<optimized out>) at arch/arm64/kernel/syscall.c:48
#17 el0_svc_common (regs=0xffffff8010953ec0, scno=<optimized out>, syscall_table=<optimized out>, sc_nr=<optimized out>) at arch/arm64/kernel/syscall.c:114
#18 0xffffff8010095444 in el0_svc_handler (regs=<optimized out>) at arch/arm64/kernel/syscall.c:160
#19 0xffffff8010084188 in el0_svc () at arch/arm64/kernel/entry.S:1009
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
因此有理由相信,在AArch64平臺的內核,應該是可以支持浮點運算的;不過有待具體操作驗證。讓人摸不到頭腦的是,爲什麼在任務切換時沒有調用fpsimd_load_state恢復新任務的浮點寄存器?我想有可能不是通過fpsimd_load_state函數來實現浮點寄存器的恢復,可能存在其他功能類似的函數吧;這一點,留待以後分析。