Linux 網絡系統學習： Neighboring Subsystem

轉自： http://blog.chinaunix.net/space.php?uid=488742&do=blog&id=2113738

1. 概述

在數據包的發送過程中，通過路由獲得下一跳的 L3 地址，下一步是獲得此 L3 地址所對應的 L2 地址，這個過程稱爲 neighbor discovery。IPv4對應的是 ARP 協議，IPv6 對應的是Neighbor Discovery 協議。

Linux 中，用於處理neighbor discovery的模塊稱爲 neighboring subsystem。它分爲兩層，底層是通用框架 neighboring infrastructure，在此之上，又有不同的具體實現，例如ARP 模塊、 ND 模塊等。

Neighboring subsystem 的主要任務包括：

1、Neighbour discovery；通過 L3 地址找到 L2 地址；爲發送數據提供保障
2、接收 neighbor 包並進行處理
3、提供 cache，以加速 neighboring 的過程
4、爲系統中其它模塊需要 neighboring discovery 而提供 APIs

2. Neighboring infrastructure

2.1 主要數據結構：

1、 struct neighbour
最主要的結構

2、 struct neigh_table
用於管理 struct neighbour

3、 struct neigh_ops
用於映射到 L2 的輸出函數

4、 struct neigh_parms

5、 struct hh_cache

2.2 數據結構關係：

下圖是 neighboring subsystem 中數據結構關係圖，其關係可描述如下：

1、系統通過 neigh_tables 來管理各種具體的 neigh_table，包括 arp_tbl 和 nd_table
2、Neigh_table 通過 hash_buckets 來維護一個 neigh_table 的 hash 表。可以迅速的增加、刪除、查找 neighbour
3、neighbour 的作用？？？ Neighbour 的 parms 指向 neigh_parms 結構，此結構用於 neighbour 的維護，例如重傳次數，狀態轉換時間，垃圾收集時間等。
4、neighbour 的 ops 指向 neigh_ops 結構，此結構用於？？？
5、neighbour 的 hh 指向 hh_cache，此結構用於 cache L2 地址，以加速 L3 到 L2 的映射過程。

2.3 工具函數

1、struct neighbour *neigh_alloc(struct neigh_table *tbl)
創建一個 neighbour，並初始化，它只被 neighbour_create() 調用

2、struct neighbour * neigh_create(struct neigh_table *tbl, const void *pkey, struct net_device *dev)
調用 neigh_alloc() 分配一個 neighboure ，然後進一步調用具體協議的構造函數，以及具體設備的特殊的設置函數；最後，將此 neighbour 加入 neighbour table 中
它主要被 __neigh_lookup() 調用，也就是說，當在 neighbour table 中找不到 neighbour 的時候，調用此函數來創建一個新的 neighbour

3、struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey, struct net_device *dev)
在 neighbour table 中尋找特定的 neighbour

4、static void neigh_timer_handler(unsigned long arg)
這是一個定時器處理函數。當某個 neighbour 超時後，由此函數處理。

5、int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, int override, int arp)

6、void neigh_table_init(struct neigh_table *tbl)
用於初始化一個 neigh_table。

每個 table 有一個定時器函數，用於垃圾收集，也就是清除那些超時的 neighbour.

    init_timer(&tbl->gc_timer);
    tbl->lock = RW_LOCK_UNLOCKED;
    tbl->gc_timer.data = (unsigned long)tbl;
    tbl->gc_timer.function = neigh_periodic_timer;
    tbl->gc_timer.expires = now + 1;
    add_timer(&tbl->gc_timer);

這個 neigh_periodic_timer 實際是

static void SMP_TIMER_NAME(neigh_periodic_timer)(unsigned long arg)

7、int neigh_table_clear(struct neigh_table *tbl)
Neigh_table 通過 hash 表來維護 neighbour

struct neighbour            **hash_buckets;

具體的協議實現需要提高 hash 函數，例如　 arp_hash()

neigh_hash_alloc 用於創建 hash 表

3. Neighbour 系統的初始化

全局變量 neigh_tables 維護系統中所有的 neigh_table

static struct neigh_table *neigh_tables;

IPv4 ARP 的初始化：
    調用neigh_table_init() 對 arp_tbl 初始化
    調用dev_add_pack(&arp_packet_type) ，註冊 ARP 包接收函數

IPv6 Neighborour Discovery 的初始化：
    調用 neigh_table_init(&nd_tbl) 對 nd_tbl 初始化
    IPv6 通過 ICMPv6 來處理 ND 的包，沒有專門的 ARP包類型。

發送數據包過程中，在路由過程中，與 neighbour 結構進行關聯，路由結束後，數據包交給 neighboring subsystem 進一步處理。

4. Routing與 Neighboring subsystem的關聯

4.1 Neighbour與路由的關聯

在路由過程中，需要尋找或創建 struct dst_entry （另一種形式是 struct rtable）。 dst_entry 通過neighbour 域與 struct neighbour 關聯。

4.1.1 關聯的目的

每個 dst_entry 對應一個 neighbour，這樣在路由之後，立刻能找到對應的 neighbour，此後，數據包通過 neighbour->output 送到鏈路層。

以 UDP 包的發送過程爲例，這個過程如下：

Udp_sendmsg() ==> ip_route_output() ==> ip_route_output_slow()

Ip_route_output_slow() ：
   當查不到路由 cache 後，根據 route rule ，通過 dst_alloc() 創建一個 dst_entry 結構，這同時也是一個 rtable 結構，然後將 dst_entry 的 output 指向 ip_output()

   rth->u.dst.output=ip_output;

   此後，udp_sendmsg 繼續調用 ip_build_xmit() 來發包

Udp_sendmsg() ==> Ip_build_xmit() ==> output_maybe_reroute ==> skb->dst->output()

這裏的 output 就是 ip_output()

ip_output ==> __ip_finish_output() ==> ip_finish_output2() ==> dst->neighbour->output()

因此，最終數據包是通過 neighbour->output() 往下送的。

4.1.2 關聯的過程

IPv4 代碼實現：ip_route_output_slow

在路由 cache 中查不到路由結果後，查找 route rule ，如果沒有合適的路由規則，則失敗返回。否則，通過 dst_alloc() 創建一個 dst_entry 結構，這同時也是一個 rtable 結構，此 rtable 結構被掛入 hash 表中。這時候我們已經有了下一跳的 L3地址。（也可能沒有，例如綁定 interface 的情況，需要看代碼是如何處理的）。

下一步，要通過arp_bind_neighbour 將 rtable 與 neighbour 進行綁定

rt_intern_hash è arp_bind_neighbour()

arp_bind_neighbour() 根據給定的下一跳 L3 地址，到 arp hash 表中找 neighbour，找到的話，dst->neighbour 就有了歸宿；找不到，只好調用 neighbour_create() 創建一個新的 neighbour，這是在__neigh_lookup_errno() 中完成的

arp_bind_neighbour() ==> __neigh_lookup_errno() ==> neigh_lookup() ==> neigh_create()。

    ip_route_output_slow()
    fib_lookup()
    rt_intern_hash()
    arp_bind_neighbour()

4.2 Neighbour 的構造和設置

neigh_alloc() 用於分配 neighbour 結構

neigh_create() 進一步設置此結構，對於 ARP 來說，它調用 arp_constructor() ，在這個函數裏面，對 neighbour 的 ops 域和 output 域進行設置。

Ops 域，根據底層 driver 的類型進行不同的設置，

    對於沒有鏈路層地址的，指向arp_direct_ops
    對於沒有鏈路層 cache 的，指向arp_generic_ops
    對於有鏈路層 cache 的，指向arp_hh_ops

對於以太網驅動程序，它的 net_device 結構在初始化的時候，已經有了默認的 hard_header 和 hard_header_cache 函數，

    ether_setup()
    dev->hard_header        = eth_header;
    dev->hard_header_cache         = eth_header_cache;

因此，默認情況下，它的 ops 指向 arp_hh_ops()

對於 output 域，關鍵是看 neighbour 的狀態，如果是有效狀態，則設置爲 ops->connected_output()，這樣可以加快速度，否則設置爲 ops->output()，這樣，需要進行 neighbor discovery 的處理。

對於 ARP 來說，無論是 output ，還是 connect_output都是指向 neigh_resolve_output()。（原因？）

neigh_resolve_output 進行 neighbor discovery 的過程。

在理解 neighbor discovery 之前，我們需要先理解 neighbour 的狀態轉換機制。

5. Neighbour 的狀態轉換

5.1 Neighbour 的狀態

Neighbour 結構可以處於不同狀態，包括：

#define NUD_INCOMPLETE 0x01
#define NUD_REACHABLE   0x02
#define NUD_STALE 0x04
#define NUD_DELAY 0x08
#define NUD_PROBE 0x10
#define NUD_FAILED            0x20
#define NUD_NOARP            0x40
#define NUD_PERMANENT 0x80
#define NUD_NONE 0x00

5.2 Neighbour 的狀態轉換過程

Linux 爲它維護一個狀態機，狀態機通過 timer 以及數據包的收發來驅動。它可以描述如下：

1、 neighbour 創建後，處於 NONE 狀態

2、 neigh_resolve_output() 調用neigh_event_send() 來觸發狀態轉換；對於 NONE 狀態來說，調用__neigh_event_send()。

__neigh_event_send()將 neighbour 狀態設置爲 INCOMPLETE，然後設置 timer，timer 的超時時間爲neigh->parms->retrans_time。
然後調用neigh->ops->solicit 發送 neighbour discovery 包。對 ARP 來說就是 arp_solicit()。
對於正常收到響應包的處理，我們在 ARP 包接收部分進行分析，總之狀態會轉換爲 REACHABLE。
如果超時未收到包，則由超時處理函數neigh_timer_handler 進行處理

3、 neigh_timer_handler() 會重傳請求包，重傳次數由neigh_max_probes() 計算。如果超過重傳次數，則狀態轉爲 FAILED

4、處於 REACHABLE 狀態的 neighbour，有一定的有效期（參數？），超過這個有效期後，由neigh_periodic_timer() 進行處理。

5、第一個有效期是 reachable_time（對 ARP，這個值是 30 HZ，也就是 300ms），這個時間以後，

6、 STALE 狀態的轉換比較難理解。Neighbour 處於 REACHABLE 狀態後，如果在一段時間內，沒有收到過 reply ，那麼則懷疑此 neighbor 不可達，需要將它的狀態轉爲 STALE，但是又不立刻進行轉換。（在哪些情況下進行狀態轉換？）

在 STABLE 狀態下，再次進入__neigh_event_send的時候，則將之狀態轉爲 DELAY，並啓動定時器。（超時時間是delay_probe_time，默認爲 50ms）
在 STALE 狀態，並不進行 ARP 解析，數據包仍然可以直接發送出去。

7、在 DELAY 狀態，數據可以直接發送出去，但是一旦定時器超時，則轉入 PROBE 狀態。如果在此期間收到過 reply 包，則轉爲 REACHABLE。

8、在 PROBE 狀態，數據仍然可以直接發送出去。但是在此狀態，開始 ARP　探測，而且僅探測一次，如果失敗，則轉爲 FAILED；如果收到 reply 包，則轉爲 REACHABLE。

9、 neigh_periodic_timer 會定時把 FAILED 狀態的 neighbour 清理掉

Linux 還定義了幾種狀態組合的變量：

#define NUD_IN_TIMER (NUD_INCOMPLETE|NUD_DELAY|NUD_PROBE)

INCOMPLETE、 DELAY、PROBE 狀態，有定時器在工作

#define NUD_VALID (NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE|NUD_PROBE|NUD_STALE|NUD_DELAY)

以上狀態，數據包都直接發送，不進行 ARP 解析過程。

當 neighbour 未處於 VALID 狀態時，數據包無法發送，只能送入 neighbour-> arp_queue 中。（可以從__neigh_event_send看到相關代碼）。

#define NUD_CONNECTED (NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE)

以上狀態，數據包不僅直接發送，而且可以肯定這個 neighboring 是可達的。

5.3 一個實際的 case

在筆者參與的一個嵌入式無線產品開發過程中，應用需要每隔 3s 發送一個 UDP 包出去，這個 UDP　包是單向的，也就是說只向外發送，並沒有響應的包。在測試過程中發現，在每次發送 UDP 包之前，都會先有一個 ARP 查詢，這種沒有必要的 ARP 包不僅影響性能，也很耗電。

分析其原因，發現：

在一次成功的 ARP 解析後，neighbour 的有效期大概爲 300 ms。

在第一次發送 UDP 包之前，因爲 Neighbour Cache 中沒有數據，所以需要發送 ARP 查詢包，以獲得下一跳的 MAC 地址。當收到 ARP 應答後，找到對應的 neighbour，然後將它的狀態轉爲 STALE，然後立刻轉爲 DELAY，並啓動 50ms 的定時器，這時候，那個導致 ARP 查詢的 UDP 包可以使用 neighbour 的映射併發送出去，但是由於這個 UDP 包並不要求迴應，因此50 ms 後，neighbour 的狀態轉爲 PROBE。這樣，3s 以後，下一個 UDP 包又會導致一次 ARP 查詢。

解決方案：
由於此嵌入式設備的下一跳肯定是網關，因此可以先通過 ARP 查詢獲得網關的 L2 地址，然後在嵌入式設備上設置一條靜態 ARP 規則，這樣，就再也不會有這種無實際意義的 ARP 包出現了。

6. Neighbor Discovery 的過程

從上面的狀態機可以看到，當 neighbour 處於 INCOMPLETE、PROBE 狀態的時候，會發送 Neighbor Solicit 包：

例如，通過 neigh_resolve_output() 導致新創建一個 neighbour 結構後，最後會調用 neigh->ops->solicit() 來發送 NS 包，對於 ARP 來說，就是 arp_solicit()：

neigh_resolve_output() ==> neigh_event_send() ==> __neigh_event_send() ==> neigh->ops->solicit(neigh, skb); ==> arp_solicit()

arp_solicit 調用 arp_send() 構造併發送 ARP request：

對於 INCOMPLETE 狀態，需要發送一個新的 ARP 請求，它的目的 MAC 地址是廣播地址，這樣鏈路上所有節點都能收到此廣播包；

對於 PROBE 狀態， neighbour 中已經有了對端的 MAC 地址，此時發 ARP request 的目的只是驗證這個映射還是有效的，因此此時發出的 ARP 包的目的 MAC 地址可以從 neighbour 中取到，是一個單播的 ARP 包。

7. ARP 包的接收處理過程

Arp_rcv() ==> arp_process()

如果收到的是 ARP request，且是到本機的，調用neigh_event_ns ，以創建一個新的 neighbour 結構，然後調用arp_send() 迴應一個 ARP reply。
如果收到的是 ARP reply，那麼調用__neigh_lookup 去查找是否有對應的 neighbour，如果沒有，則丟棄此包；否則調用neigh_update() 將此 neighbour 狀態更新爲 REACHABLE。同時，所有在此 neighbour 上等待的數據包被髮送

8. 參考文獻

1、Linux 2.4 內核源碼
2、Linux 2.6 內核源碼
3、<<The.Linux.Networking.Architecture_Design.and.Implementation.of.Network.Protocols.in.the.Linux.Kernel >>
4、<<Understanding Linux Network Internals>>
5、<< The Linux TCPIP Stack- Networking for Embedded Systems>>

Linux 網絡系統學習： Neighboring Subsystem

1. 概述

2. Neighboring infrastructure

2.1 主要數據結構：

2.2 數據結構關係：

2.3 工具函數

3. Neighbour 系統的初始化

4. Routing與 Neighboring subsystem的關聯

4.1 Neighbour與路由的關聯

4.1.1 關聯的目的

4.1.2 關聯的過程

4.2 Neighbour 的構造和設置

5.3 一個實際的 case

6. Neighbor Discovery 的過程

7. ARP 包的接收處理過程

8. 參考文獻

Linux藍牙系列(1) --- bluetooth基本概念

todo 沒有分類-會後續移到上面

Linux 協議棧分析 socket

Linux 協議棧分析 socket——筆記

Android2.2添加Ethernet 框架支持（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Linux 網絡系統學習： Neighboring Subsystem

1. 概述

2.1 主要數據結構：

2.2 數據結構關係：

2.3 工具函數

3. Neighbour 系統的初始化

4.1 Neighbour與路由的關聯

4.1.1 關聯的目的

4.1.2 關聯的過程

4.2 Neighbour 的構造和設置

5.3 一個實際的 case

6. Neighbor Discovery 的 過程

7. ARP 包的接收處理過程

8. 參考文獻

6. Neighbor Discovery 的過程