簡析PPC的Device Tree機制

年底完成了公司設備從arm到ppc的移植，有很多心得需要總結，趁年後不是很忙，整理寫下來。
自己也是第一次接觸ppc架構的kernel（版本號：3.4.55），很多東西學習不夠深入，只寫個思路框架，不去深究細節，錯誤地方還望大家指正。
今天首先來總結下PPC的Device Tree設備樹機制，之前在移植arm的uboot以及kernel時，uboot和kernel之前的傳參機制在arm架構下是可以選擇的，使用tags方式還是fdt方式（flattened device tree）。我選擇使用tags，之前有總結過tags的傳參方式，可以參考我的另一篇文章，鏈接如下：
http://blog.csdn.net/skyflying2012/article/details/35787971
但是閱讀了PPC架構的kernel啓動代碼後，發現PPC架構kernel啓動傳參僅支持fdt方式，趁這個機會學習下fdt機制。
1 爲什麼要用FDT，FDT優點是什麼。
從網上找到的官方解釋如下：
IBM、Sun等廠家的服務器最初都採用了Firmware（一種嵌入到硬件設備中的程序，用於提供軟件和硬件之間的接口），用於初始化系統配置，提供操作系統軟件和硬件之間的接口，啓動和運行系統。後來爲了標準化和兼容性，IBM、Sun等聯合推出了固件接口IEEE 1275標準，讓他們的服務器如IBM PowerPCpSeries，Apple PowerPC，Sun SPARC等均採用Open Firmware，在運行時構建系統硬件的設備樹信息傳遞給內核，進行系統的啓動運行。這樣做的好處有，減少內核對系統硬件的嚴重依賴，利於加速支持包的開發，降低硬件帶來的變化需求和成本，降低對內核設計和編譯的要求。
在嵌入式PowerPC中，一般使用U-Boot之類的系統引導代碼，而不採用Open Firmware。早期的U-Boot使用include/asm-ppc/u-boot.h中的靜態數據結構struct bd_t將板子基本信息傳遞給內核，其餘的由內核處理。這樣的接口不夠靈活，硬件發生變化就需要重新定製編譯燒寫引導代碼和內核，而且也不再適應於現在的內核。爲了適應內核的發展及嵌入式PowerPC平臺的千變萬化，吸收標準OpenFirmware的優點，UBoot引入了扁平設備樹FDT這樣的動態接口，使用一個單獨的FDT blob（二進制大對象，是一個可以存儲二進制文件的容器）存儲傳遞給內核的參數，一些確定信息，例如cache大小、中斷路由等直接由設備樹提供，而其他的信息，例如eTSEC的MAC地址、頻率、PCI總線數目等由U-Boot在運行時修改。

我的理解是爲了適應靈活的嵌入式平臺，FDT將一些固定人爲需要修改的參數信息從uboot和kernel中（如uboot下的bd_t）剝離出來，修改硬件後，不需要重新修改燒錄uboot kernel，僅需要修改FDT文件即可完成對新硬件的支持。但是有一些動態修改的信息還是需要uboot以及kernel來操作，如cmdline，usb以及pci的枚舉設備信息。
對比而言，arm下使用的tags方式就是需要對uboot中的tags（如mem大小等）進行修改，完成對新硬件的支持。
2 FDT怎麼用，格式是什麼。
FDT設備樹我們可以看做是描述設備硬件配置的線性樹形數據結構，開發人員需要根據設備硬件配置來編寫設備樹，設備樹的編寫提供一套完全可視化的文本形式dts（device tree source），然後利用dtc（device tree compiler）編譯成kernel需要的設備數鏡像文件dtb，d t c 編譯器會對輸入文件進行語法和語義檢查，並根據L i n u x 內核的要求檢查各節點及屬性，將設備樹源碼文件（. d t s ）編譯二進制文件（. d t b ），以保證內核能正常啓動，一個簡單的例子如下：

/ {
    #address-cells = <1>;
    #size-cells = <1>;
    model = "test";
    compatible = "test";
    dcr-parent = <&{/cpus/cpu@0}>;

    cpus {
        #address-cells = <1>;
        #size-cells = <0>;

        cpu@0 {
            device_type = "cpu";
            model = "PowerPC,460EX";
            reg = <0x00000000>;
            i-cache-line-size = <32>;
            d-cache-line-size = <32>;
            i-cache-size = <32768>;
            d-cache-size = <32768>;
            dcr-controller;
            dcr-access-method = "native";
        };
    };

    memory {
        device_type = "memory";
        reg = <0x80000000 0x40000000>;
    };

    chosen {
        name = "chosen";
        bootargs = "console=ttyS0,115200 mem=512M rdinit=/sbin/init";
    };
};

這是我移植kernel時根據kernel下提供的dts文件修改的，kernel下已經有很多設備的dts文件，在arch/powerpc/boot/dts下，並且也集成了dtc編譯器，我上面的dts文件是arch/powerpc/boot/dts/test.dts,則我可以在kernel下運行如下命令：

make test.dtb

就可以生成對應的dtb鏡像。
對於開發人員來說，直接面對的是dts文件，下來就來說下dts文件的格式：
（dts格式網上有很多詳細解釋，並且在kernel下也有詳細說明的文檔，是Documentation/devicetree/booting-without-of.txt）
1 根節點
設備樹的起始點稱之爲根節點” / ” 。屬性m o d e l 指明瞭目標板平臺或模塊的名稱，屬性c o m p a t i b l e 值指明和目標板爲同一系列的兼容的開發板名稱。對於大多數3 2 位平臺，屬性# a d d r e s s - c e l l s 和# s i z e - c e l l s 的值一般爲1 ，address-cells和size-cells分別定義了子節點地址和長度的寬度。
2 CPU節點
/ c p u s 節點是根節點的子節點，對於系統中的每一個C P U ，都有相應的節點。/ c p u s 節點沒有必須指明的屬性，但指明# a d d r e s s - c e l l s = < 1 > 和 # s i z e - c e l l s = < 0 > 是個好習慣，這同時指明瞭每個C P U 節點的r e g 屬性格式，方便爲物理C P U 編號。C P U 節點的單元名應該是c p u @ 0 的格式，此節點一般要指定d e v i c e _ t y p e （固定爲” c p u ” ），一級數據/ 指令緩存的表項大小，一級數據/ 指令緩存的大小，核心、總線時鐘頻率等。在上面的示例中通過系統引導代碼動態填寫時鐘頻率相關項。
3 系統內存節點
此節點用於描述目標板上物理內存範圍，一般稱作/ m e m o r y 節點，可以有一個或多個。當有多個節點時，需要後跟單元地址予以區分；只有一個單元地址時，可以不寫單元地址，默認爲0 。
此節點包含板上物理內存的屬性，一般要指定d e v i c e _ t y p e （固定爲” m e m o r y ” ）和r e g 屬性。其中r e g 的屬性值以< 起始地址空間大小> 的形式給出，如上示例中目標板內存起始地址爲0x80000000 ，大小爲1G字節。
4 /chosen節點
這個節點有一點特殊。通常，這裏由O p e n F i r m w a r e 存放可變的環境信息，例如參數，默認輸入輸出設備。
這個節點中一般指定b o o t a r g s 及l i n u x , s t d o u t - p a t h 屬性值。b o o t a r g s 屬性設置爲傳遞給內核命令行的參數字符串。l i n u x , s t d o u t - p a t h 常常爲標準終端設備的節點路徑名，內核會以此作爲默認終端。U - B o o t 在1 . 3 . 0 版本後添加了對扁平設備樹F D T 的支持，U - B o o t 加載L i n u x 內核、R a m d i s k 文件系統（如果使用的話）和設備樹二進制鏡像到物理內存之後，在啓動執行L i n u x 內核之前，它會修改設備樹二進制文件。它會填充必要的信息到設備樹中，例如M A C 地址、P C I 總線數目等。U - B o o t 也會填寫設備樹文件中的“/ c h o s e n ”節點，包含了諸如串口、根設備（R a m d i s k 、硬盤或N F S 啓動）等相關信息。U - B o o t 源碼c o m m o n / c m d _ b o o t m . c 的如下代碼，顯示了在執行內核代碼前將調用f t _ s e t u p 函數填寫設備樹。
dts中最多的是SOC上的外設硬件配置，因爲我在移植中爲了保證原來原先依賴於arm框架的代碼不變（沒有使用FDT），模塊driver中儘量不用設備樹，所以dts中沒有寫外設硬件配置，這個有時間再去仔細研究。

3 dtb鏡像的存儲格式
現在學習代碼，已經不像剛畢業那會對於任何代碼都會死摳細節，而是想觀其大略，瞭解其框架，待需要細究時在仔細研究，我想這也是一種進步，能讓自己在kernel星辰大海中更加從容一點。
學習代碼，我一直追求弄明白原因（爲什麼這樣做）和方法（如何做）。
首先來看dtc編譯dts生成的dtb鏡像文件是什麼格式的。
1 設備樹主要由三大部分組成：頭（H e a d e r ）、結構塊（S t r u c t u r e b l o c k ）、字符串塊（S t r i n g s b l o c k ）。在內存中分配圖如下：

頭主要描述設備樹的基本信息，如設備樹魔數標誌、設備樹塊大小、結構塊的偏移地址等，其具體結構b o o t _ p a r a m _ h e a d e r 如下。這個結構中的值都是以大端模式表示，並且偏移地址是相對於設備樹頭的起始地址計算的。

/*
 * This is what gets passed to the kernel by prom_init or kexec
 *
 * The dt struct contains the device tree structure, full pathes and
 * property contents. The dt strings contain a separate block with just
 * the strings for the property names, and is fully page aligned and
 * self contained in a page, so that it can be kept around by the kernel,
 * each property name appears only once in this page (cheap compression)
 *
 * the mem_rsvmap contains a map of reserved ranges of physical memory,
 * passing it here instead of in the device-tree itself greatly simplifies
 * the job of everybody. It's just a list of u64 pairs (base/size) that
 * ends when size is 0
 */
struct boot_param_header {
    __be32  magic;          /* magic word OF_DT_HEADER */
    __be32  totalsize;      /* total size of DT block */
    __be32  off_dt_struct;      /* offset to structure */
    __be32  off_dt_strings;     /* offset to strings */
    __be32  off_mem_rsvmap;     /* offset to memory reserve map */
    __be32  version;        /* format version */
    __be32  last_comp_version;  /* last compatible version */
    /* version 2 fields below */
    __be32  boot_cpuid_phys;    /* Physical CPU id we're booting on */
    /* version 3 fields below */
    __be32  dt_strings_size;    /* size of the DT strings block */
    /* version 17 fields below */
    __be32  dt_struct_size;     /* size of the DT structure block */
};

2 結構塊（structure block）
扁平設備樹結構塊是線性化的樹形結構，和字符串塊一起組成了設備樹的主體，以節點形式保存目標板的
設備信息。在結構塊中，節點起始標誌爲3 2 位常值宏O F _ D T _ B E G I N _ N O D E ，節點結束標誌爲宏O F _ D T _ E N D _ N O D E ；子節點定義在節點結束標誌前。一個節點的基本結構如下所示：
1 . 節點起始標誌O F _ D T _ B E G I N _ N O D E （即0 x 0 0 0 0 _ 0 0 0 1 ）;
2 . 節點路徑或者節點單元名（v e r s i o n < 3 以及節點路徑表示，v e r s i o n > 1 6 時以節點單元名錶示）；
3 . 填充字節保證四字節對齊；
4 . 節點屬性。每個屬性以常值宏O F _ D T _ P R O P 開始，後面依次爲屬性值的字節長度、屬性名在在字符串塊
中的偏移值、屬性值及字節對齊填充段；
5 . 如果存在子節點，則定義子節點。
6 . 節點結束標誌O F _ D T _ E N D _ N O D E （即0 x 0 0 0 0 _ 0 0 0 2 ）。
歸納起來，一個節點可以概括爲以O F _ D T _ B E G I N _ N O D E 開始，節點路徑、屬性列表、子節點列表以及
O F _ D T _ E N D _ N O D E 結束的序列，每一個子節點自身也是類似的結構。
3 字符串塊（Strings block）
爲了節省空間，對於那些屬性名，尤其是很多屬性名是重複冗餘出現的，提取出來單獨存放到字符串塊。
這個塊中包含了很多有結束標誌的屬性名字符串。在設備樹的結構塊中存儲了這些字符串的偏移地址，因
爲可以很容易的查找到屬性名字符串。字符串塊的引入節省嵌入式系統較爲緊張的存儲空間。

4 kernel如何解析FDT
我們利用dtc編譯了dts文件生成dtb，那麼kernel就會“反彙編”dtb，從而獲取其中的配置信息，因此上面描述到的dtb文件存儲格式都會在kernel的解析中體現出來。
dtb文件是獨立於bootloader以及kernel存在的，dtb中的chosen節點需要uboot中進行填寫，dtb鏡像地址也由uboot傳遞給kernel，保存在r3寄存器中，但是由於我移植中dtb的chosen手動填寫，並且不用uboot啓動kernel，所以修改kernel啓動代碼，直接寫死dtb的首地址，代碼如下：

/* As with the other PowerPC ports, it is expected that when code
 * execution begins here, the following registers contain valid, yet
 * optional, information:
 *
 *   r3 - Board info structure pointer (DRAM, frequency, MAC address, etc.)
 *   r4 - Starting address of the init RAM disk
 *   r5 - Ending address of the init RAM disk
 *   r6 - Start of kernel command line string (e.g. "mem=128")
 *   r7 - End of kernel command line string
 *
 */
    __HEAD
_ENTRY(_stext);
_ENTRY(_start);
    /*
     * Reserve a word at a fixed location to store the address
     * of abatron_pteptrs
     */
    nop

    #device tree phy addr
    lis r3, 0x81000000@h
    ori r3, r3, 0x81000000@l

    mr  r31,r3      /* save device tree ptr */
    li  r24,0       /* CPU number */

PPC架構kernel對FDT解析可以分爲兩部分：
第一步是早期解析，獲取kernel啓動必需的cmdline以及cpu mem等信息。
第二步是後期的完全解析，以供driver加載時獲取對應配置信息使用。
由於移植中儘量讓driver不使用FDT，所以今天主要分析早期解析過程，進入start kernel之前調用machine init
在arch/powerpc/kernel/setup_32.c中，machine init則調用early init devtree完成早期設備樹的解析，在arch/powerpc/kernel/prom.c,代碼如下：

void __init early_init_devtree(void *params)
{
    phys_addr_t limit;

    /* Setup flat device-tree pointer */
    initial_boot_params = params;

#ifdef CONFIG_PPC_RTAS
    /* Some machines might need RTAS info for debugging, grab it now. */
    of_scan_flat_dt(early_init_dt_scan_rtas, NULL);
#endif

#ifdef CONFIG_PPC_POWERNV
    /* Some machines might need OPAL info for debugging, grab it now. */
    of_scan_flat_dt(early_init_dt_scan_opal, NULL);
#endif

#ifdef CONFIG_FA_DUMP
    /* scan tree to see if dump is active during last boot */
    of_scan_flat_dt(early_init_dt_scan_fw_dump, NULL);
#endif

    /* Pre-initialize the cmd_line with the content of boot_commmand_line,
     * which will be empty except when the content of the variable has
     * been overriden by a bootloading mechanism. This happens typically
     * with HAL takeover
     */
    strlcpy(cmd_line, boot_command_line, COMMAND_LINE_SIZE);

    /* Retrieve various informations from the /chosen node of the
     * device-tree, including the platform type, initrd location and
     * size, TCE reserve, and more ...
     */

    of_scan_flat_dt(early_init_dt_scan_chosen_ppc, cmd_line);

    /* Scan memory nodes and rebuild MEMBLOCKs */
    of_scan_flat_dt(early_init_dt_scan_root, NULL);
    of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);

    /* Save command line for /proc/cmdline and then parse parameters */
    strlcpy(boot_command_line, cmd_line, COMMAND_LINE_SIZE);
    parse_early_param();

    /* make sure we've parsed cmdline for mem= before this */
    if (memory_limit)
        first_memblock_size = min(first_memblock_size, memory_limit);
    setup_initial_memory_limit(memstart_addr, first_memblock_size);
    /* Reserve MEMBLOCK regions used by kernel, initrd, dt, etc... */
    memblock_reserve(PHYSICAL_START, __pa(klimit) - PHYSICAL_START);
    /* If relocatable, reserve first 32k for interrupt vectors etc. */
    if (PHYSICAL_START > MEMORY_START)
        memblock_reserve(MEMORY_START, 0x8000);
    reserve_kdump_trampoline();
#ifdef CONFIG_FA_DUMP
    /*
     * If we fail to reserve memory for firmware-assisted dump then
     * fallback to kexec based kdump.
     */
    if (fadump_reserve_mem() == 0)
#endif
        reserve_crashkernel();
    early_reserve_mem();

    /*
     * Ensure that total memory size is page-aligned, because otherwise
     * mark_bootmem() gets upset.
     */
    limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE);
    memblock_enforce_memory_limit(limit);

    memblock_allow_resize();
    memblock_dump_all();

    DBG("Phys. mem: %llx\n", memblock_phys_mem_size());

    /* We may need to relocate the flat tree, do it now.
     * FIXME .. and the initrd too? */
    move_device_tree();

    allocate_pacas();

    DBG("Scanning CPUs ...\n");

    /* Retrieve CPU related informations from the flat tree
     * (altivec support, boot CPU ID, ...)
     */
    of_scan_flat_dt(early_init_dt_scan_cpus, NULL);

#if defined(CONFIG_SMP) && defined(CONFIG_PPC64)
    /* We'll later wait for secondaries to check in; there are
     * NCPUS-1 non-boot CPUs  :-)
     */
    spinning_secondaries = boot_cpu_count - 1;
#endif

    DBG(" <- early_init_devtree()\n");
}

調用of_scan_flat_dt來遍歷dtb中所有節點，調用解析函數early_init_dt_scan_chosen_ppc early_init_dt_scan_mem_ppc early_init_dt_scan_root early_init_dt_scan_cpus，分別獲取chosen mem cpus節點信息，完成早期cmdline mem cpu的操作。我們來看一個mem的解析函數，代碼如下：

int __init early_init_dt_scan_chosen(unsigned long node, const char *uname,
                     int depth, void *data)
{
    unsigned long l;
    char *p;

    pr_debug("search \"chosen\", depth: %d, uname: %s\n", depth, uname);

    if (depth != 1 || !data ||
        (strcmp(uname, "chosen") != 0 && strcmp(uname, "chosen@0") != 0))
        return 0;

    early_init_dt_check_for_initrd(node);

    /* Retrieve command line */
    p = of_get_flat_dt_prop(node, "bootargs", &l);
    if (p != NULL && l > 0)
        strlcpy(data, p, min((int)l, COMMAND_LINE_SIZE));

    /*
     * CONFIG_CMDLINE is meant to be a default in case nothing else
     * managed to set the command line, unless CONFIG_CMDLINE_FORCE
     * is set in which case we override whatever was found earlier.
     */
#ifdef CONFIG_CMDLINE
#ifndef CONFIG_CMDLINE_FORCE
    if (!((char *)data)[0])
#endif
        strlcpy(data, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
#endif /* CONFIG_CMDLINE */

    pr_debug("Command line is: %s\n", (char*)data);

    /* break now */
    return 1;
}

對於fdt的處理函數主要在arch/powerpc/kernel/prom.c以及driver/of/fdt.c中。

與之前文章分析tags解析方式對比，可以看出FDT的解析跟tags解析的差別之處在於，
tags是採用註冊回調函數方式，解析什麼類型tags，則調用該類型對應處理函數。
fdt是採用遍歷整個設備樹，在處理函數中判斷是否是所需要解析的內容，然後進行處理。

kerneler_

發佈了128 篇原創文章 · 獲贊 243 · 訪問量 182萬+

私信關注

簡析PPC的Device Tree機制

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

一個奇葩bug的解決

內核中斷號必須要跟硬件中斷號一致嗎

熟悉又陌生的udelay

嵌入式設備的網絡性能該如何分析

對於字節序小端和大端的思考

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結