各協議族傳輸層使用各自的傳輸控制塊存放套接口所要求的信息。TCP傳輸控制塊、UDP傳輸控制塊、原始IP傳輸控制塊等
Linux內核的傳輸控制塊定義是非常巧妙的---根據協議族和傳輸層協議的特點,分層次地定義了多個結構用來組成傳輸控制塊。IPv4協議族包括sock_common、sock、inet_sock、inet_connection_sock、tcp_sock、request_sock、inet_request_sock、tcp_request_sock、inet_timewait_sock、tcp_timewait_sock、udp_sock、raw_sock結構。
sock_common
該結構是傳輸控制塊信息的最小集合,由sock和inet_timewait_sock結構前面相同部分單獨構成,因此只用來構成這兩種結構
sock
該結構是比較通用的網絡層描述塊,構成傳輸控制塊的基礎,與具體的協議族無關。它描述了各個協議族傳輸層協議的功能信息,因此不能直接作爲傳輸控制塊來使用,不同協議族的傳輸層在使用該結構時都會對其進行擴展,來適合各自的傳輸特性,例如,inet_sock結構由sock結構及其他特性組成,構成IPv4協議族傳輸控制塊的基礎
inet_sock
該結構是比較通用的IPv4協議族描述塊,包含IPv4協議族基礎傳輸層,即UDP、TCP以及原始傳輸控制塊共有的信息
inet_connection_sock
該結構是支持面向連接特性的描述塊,構成IPv4協議族TCP控制塊的基礎,在inet_sock結構的基礎上加入了支持連接的特性
tcp_sock
該結構即TCP傳輸控制塊,支持完整的TCP特性,包含了TCP爲各連接維護的所有節點信息
inet_timewait_sock
該結構是支持面向連接特性的TCP_TIME_WAIT狀態的描述,是構成tcp_timewait_sock的基礎
tcp_timewait_sock
該結構是TCP_TIME_WAIT狀態描述塊,是一種比較特殊的傳輸控制塊,當TCP狀態爲TCP_TIME_WAIT時,tcp_sock結構會蛻變爲tcp_timewait_sock結構
udp_sock
該結構是UDP傳輸控制塊,支持UDP的完整特性。UDP需要的信息基本都在inet_sock結構中描述。
基本傳輸控制塊和IPv4專用的傳輸控制塊以及傳輸層通用的函數涉及以下文件:
include/net/sock.h 定義基本的傳輸控制塊結構、宏和函數原型
include/net/inet_sock.h 定義IPv4專用的傳輸控制塊
net/core/sock.c 實現傳輸層通用的函數
net/socket.c 實現套接口層的調用。
傳輸控制塊的內存管理
傳輸控制塊的分配和釋放
sk_alloc()
在創建套接口時,TCP、UDP和原始IP會分配一個傳輸控制塊。分配傳輸控制塊的函數爲sk_alloc()。當傳輸控制塊生命結束以後,通過sk_free()將其釋放。
/**
* sk_alloc - All socket objects are allocated here
* @net: the applicable net namespace
* @family: protocol family
* @priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
* @prot: struct proto associated with this new sock instance
*/
struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
struct proto *prot)
{
struct sock *sk;
sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family);
if (sk) {
sk->sk_family = family;
/*
* See comment in struct sock definition to understand
* why we need sk_prot_creator -acme
*/
sk->sk_prot = sk->sk_prot_creator = prot;
sock_lock_init(sk);
sock_net_set(sk, get_net(net));
atomic_set(&sk->sk_wmem_alloc, 1);
}
return sk;
}
sk_free()
sk_free()通常用於釋放指定的傳輸控制塊,通常由sock_put()調用,當指定的控制塊的引用計數爲0時纔會調用此函數進行釋放操作。
static void __sk_free(struct sock *sk)
{
struct sk_filter *filter;
if (sk->sk_destruct)
sk->sk_destruct(sk);
filter = rcu_dereference(sk->sk_filter);
if (filter) {
sk_filter_uncharge(sk, filter);
rcu_assign_pointer(sk->sk_filter, NULL);
}
sock_disable_timestamp(sk, SOCK_TIMESTAMP);
sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);
if (atomic_read(&sk->sk_omem_alloc))
printk(KERN_DEBUG "%s: optmem leakage (%d bytes) detected.\n",
__func__, atomic_read(&sk->sk_omem_alloc));
put_net(sock_net(sk));
sk_prot_free(sk->sk_prot_creator, sk);
}
void sk_free(struct sock *sk)
{
/*
* We substract one from sk_wmem_alloc and can know if
* some packets are still in some tx queue.
* If not null, sock_wfree() will call __sk_free(sk) later
*/
if (atomic_dec_and_test(&sk->sk_wmem_alloc))
__sk_free(sk);
}
普通發送緩存區分配
sock_alloc_send_skb()
主要爲UDP和RAW套接口分配用於輸出的SKB。與sock_wmalloc()相比,在分片過程中考慮的細節比較多,支持檢測傳輸控制塊已經發生的錯誤、檢測關閉套接口的標誌、阻塞等,實際上是直接調用sock_alloc_send_pskb()實現的。
/*
* Generic send/receive buffer handlers
*/
struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
unsigned long data_len, int noblock,
int *errcode)
{
struct sk_buff *skb;
gfp_t gfp_mask;
long timeo;
int err;
gfp_mask = sk->sk_allocation;
if (gfp_mask & __GFP_WAIT)
gfp_mask |= __GFP_REPEAT;
timeo = sock_sndtimeo(sk, noblock);
while (1) {
err = sock_error(sk);
if (err != 0)
goto failure;
err = -EPIPE;
if (sk->sk_shutdown & SEND_SHUTDOWN)
goto failure;
if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) {
skb = alloc_skb(header_len, gfp_mask);
if (skb) {
int npages;
int i;
/* No pages, we're done... */
if (!data_len)
break;
npages = (data_len + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
skb->truesize += data_len;
skb_shinfo(skb)->nr_frags = npages;
for (i = 0; i < npages; i++) {
struct page *page;
skb_frag_t *frag;
page = alloc_pages(sk->sk_allocation, 0);
if (!page) {
err = -ENOBUFS;
skb_shinfo(skb)->nr_frags = i;
kfree_skb(skb);
goto failure;
}
frag = &skb_shinfo(skb)->frags[i];
frag->page = page;
frag->page_offset = 0;
frag->size = (data_len >= PAGE_SIZE ?
PAGE_SIZE :
data_len);
data_len -= PAGE_SIZE;
}
/* Full success... */
break;
}
err = -ENOBUFS;
goto failure;
}
set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
err = -EAGAIN;
if (!timeo)
goto failure;
if (signal_pending(current))
goto interrupted;
timeo = sock_wait_for_wmem(sk, timeo);
}
skb_set_owner_w(skb, sk);
return skb;
interrupted:
err = sock_intr_errno(timeo);
failure:
*errcode = err;
return NULL;
}
struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
int noblock, int *errcode)
{
return sock_alloc_send_pskb(sk, size, 0, noblock, errcode);
}
發送緩存的分配與釋放
sock_wmalloc()
sock_wmalloc的作用也是分配發送緩存。在TCP中,只是在構造SYN+ACK時使用,發送用戶數據時通常sk_stream_alloc_pskb()分配發送緩存。
/*
* Allocate a skb from the socket's send buffer.
*/
struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
gfp_t priority)
{
if (force || atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) {
struct sk_buff *skb = alloc_skb(size, priority);
if (skb) {
skb_set_owner_w(skb, sk);
return skb;
}
}
return NULL;
}
skb_set_owner_w()
每個用於輸出的SKB都要關聯到一個傳輸控制塊上,這樣可以調整該傳輸控制塊爲發送而分配的所有SKB數據區的總大小,並設置此SKB的銷燬函數。
/*
* Queue a received datagram if it will fit. Stream and sequenced
* protocols can't normally use this as they need to fit buffers in
* and play with them.
*
* Inlined as it's very short and called for pretty much every
* packet ever received.
*/
static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
{
skb_orphan(skb);
skb->sk = sk;
skb->destructor = sock_wfree;
/*
* We used to take a refcount on sk, but following operation
* is enough to guarantee sk_free() wont free this sock until
* all in-flight packets are completed
*/
atomic_add(skb->truesize, &sk->sk_wmem_alloc);
}
sock_wfree()
sock_wfree()通常設置到用於輸出SKB的銷燬函數接口上,當釋放該SKB時被調用,用於更新所屬傳輸控制塊爲發送而分配的所有SKB數據區的總大小,調用sk_write_space接口來喚醒因等待本套接口而處於睡眠狀態的進程,遞減對所屬傳輸控制塊的引用
/*
* Simple resource managers for sockets.
*/
/*
* Write buffer destructor automatically called from kfree_skb.
*/
void sock_wfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
unsigned int len = skb->truesize;
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
/*
* Keep a reference on sk_wmem_alloc, this will be released
* after sk_write_space() call
*/
atomic_sub(len - 1, &sk->sk_wmem_alloc);
sk->sk_write_space(sk);
len = 1;
}
/*
* if sk_wmem_alloc reaches 0, we must finish what sk_free()
* could not do because of in-flight packets
*/
if (atomic_sub_and_test(len, &sk->sk_wmem_alloc))
__sk_free(sk);
}
接收緩存的分配與釋放
用於輸入的SKB都是在驅動層通過dev_alloc_skb()或alloc_skb()進行分配的,在傳遞至傳輸層以前,並不屬於哪個具體的傳輸控制塊。但是一旦進入傳輸層,便需要設置該SKB的宿主。
skb_set_owner_r()
當UDP數據報的SKB傳遞並添加到UDP傳輸控制塊的接收隊列中,便會調用skb_set_owner_r()設置該SKB的宿主,並設置此SKB的銷燬函數,還要更新接收隊列中所有報文數據的總長度。
static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb_orphan(skb);
skb->sk = sk;
skb->destructor = sock_rfree;
atomic_add(skb->truesize, &sk->sk_rmem_alloc);
sk_mem_charge(sk, skb->truesize);
}
異步I/O機制
儘快阻塞和非阻塞操作同select方法的結合對於查詢設備在大多數情況下是有效的,但在某些情況下還不能完全有效地解決問題。
例如一個進程,在低優先級上執行一個較長的計算循環,但是需要儘可能快地處理輸入數據。如果進程通過響應外設獲取數據,當新數據可用時它應當立刻知道。通常應用程序可調用select()有規律地檢查數據,但是,如果更迅速地處理外設數據,就可以使用異步通知的方法,使應用程序接收一個信號,而不需要主動查詢。
用戶程序必須執行2個步驟使能來自輸入文件的異步通知。首先,它們指定一個進程作爲文件的擁有者。當一個進程使用fcntl系統調用發出F_SETOWN命令,這個擁有者進程的ID被保存在filp->f_owner中供以後使用。通過這一步,內核便知道通知的對象。爲了真正使能異步通知,用戶程序必須通過fcntl的F_SETFL命令在設備中設置FASYNC標誌。在這兩個調用執行後,處理異步IO的進程可接管SIGIO信號,此後,無論新數據何時到達,信號都會發送給存儲與filp->f_owner中的進程。
例如,下面的用戶程序中的代碼實現了向當前進程發送標準輸入文件的異步通知:
signal(SIGIO, &input_handler);
fcntl(STDIN_FILENO, F_SETOWN, getpid());
oflags = fcntl(STDIN_FILENO, F_GETFL);
fcntl(STDIN_FILENO, F_SETFL, oflags | FASYNC)
sk_wake_async()
用來將SIGIO或SIGURG信號發送給在該套接口上的進程,通知該進程可以對該文件進行讀或寫。
/* This function may be called only under socket lock or callback_lock */
int sock_wake_async(struct socket *sock, int how, int band)
{
if (!sock || !sock->fasync_list)
return -1;
switch (how) {
case SOCK_WAKE_WAITD:
if (test_bit(SOCK_ASYNC_WAITDATA, &sock->flags))
break;
goto call_kill;
case SOCK_WAKE_SPACE:
if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags))
break;
/* fall through */
case SOCK_WAKE_IO:
call_kill:
__kill_fasync(sock->fasync_list, SIGIO, band);
break;
case SOCK_WAKE_URG:
__kill_fasync(sock->fasync_list, SIGURG, band);
}
return 0;
}
static inline void sk_wake_async(struct sock *sk, int how, int band)
{
if (sk->sk_socket && sk->sk_socket->fasync_list)
sock_wake_async(sk->sk_socket, how, band);
}
how
enum {
SOCK_WAKE_IO, 檢測標識應用程序通過recv等調用時,是否在等待數據的接收
SOCK_WAKE_WAITD, 檢測傳輸控制塊的發送隊列是否曾經達到上限
SOCK_WAKE_SPACE, 不做任何檢測,直接向等待進程發送SIGIO信號
SOCK_WAKE_URG, 向等待進程發送SIGURG信號
};
band
/*
* SIGPOLL si_codes
*/
#define POLL_IN (__SI_POLL|1)/* data input available */
#define POLL_OUT (__SI_POLL|2)/* output buffers available */
#define POLL_MSG (__SI_POLL|3)/* input message available */
#define POLL_ERR (__SI_POLL|4)/* i/o error */
#define POLL_PRI (__SI_POLL|5)/* high priority input available */
#define POLL_HUP (__SI_POLL|6)/* device disconnected */
sock_def_wakeup()
用於喚醒傳輸控制塊的sk_sleep隊列上的睡眠進程,是傳輸控制塊默認的喚醒等待該套接口的函數。該函數設置到傳輸控制塊的sk_state_change接口上,通常當傳輸控制塊的狀態發生變化時被調用。
/*
* Default Socket Callbacks
*/
static void sock_def_wakeup(struct sock *sk)
{
read_lock(&sk->sk_callback_lock);
if (sk_has_sleeper(sk))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
}
接收到FIN段後通知進程
在TCP中還有些地方會通知套接口的fasync_list隊列上的進程。比如,當TCP接收到FIN段後,如果此時套接口未在DEAD狀態,則喚醒等待該套接口的進程。如果在發送接收方向都進行了關閉,或者此時該傳輸控制塊處於CLOSE狀態,則通知異步等待該套接口的進程,該連接已經終止,否則通知進程連接可以進行寫操作。
/*
* Process the FIN bit. This now behaves as it is supposed to work
* and the FIN takes effect when it is validly part of sequence
* space. Not before when we get holes.
*
* If we are ESTABLISHED, a received fin moves us to CLOSE-WAIT
* (and thence onto LAST-ACK and finally, CLOSE, we never enter
* TIME-WAIT)
*
* If we are in FINWAIT-1, a received FIN indicates simultaneous
* close and we go into CLOSING (and later onto TIME-WAIT)
*
* If we are in FINWAIT-2, a received FIN moves us to TIME-WAIT.
*/
static void tcp_fin(struct sk_buff *skb, struct sock *sk, struct tcphdr *th)
{
struct tcp_sock *tp = tcp_sk(sk);
inet_csk_schedule_ack(sk);
sk->sk_shutdown |= RCV_SHUTDOWN;
sock_set_flag(sk, SOCK_DONE);
switch (sk->sk_state) {
case TCP_SYN_RECV:
case TCP_ESTABLISHED:
/* Move to CLOSE_WAIT */
tcp_set_state(sk, TCP_CLOSE_WAIT);
inet_csk(sk)->icsk_ack.pingpong = 1;
break;
case TCP_CLOSE_WAIT:
case TCP_CLOSING:
/* Received a retransmission of the FIN, do
* nothing.
*/
break;
case TCP_LAST_ACK:
/* RFC793: Remain in the LAST-ACK state. */
break;
case TCP_FIN_WAIT1:
/* This case occurs when a simultaneous close
* happens, we must ack the received FIN and
* enter the CLOSING state.
*/
tcp_send_ack(sk);
tcp_set_state(sk, TCP_CLOSING);
break;
case TCP_FIN_WAIT2:
/* Received a FIN -- send ACK and enter TIME_WAIT. */
tcp_send_ack(sk);
tcp_time_wait(sk, TCP_TIME_WAIT, 0);
break;
default:
/* Only TCP_LISTEN and TCP_CLOSE are left, in these
* cases we should never reach this piece of code.
*/
printk(KERN_ERR "%s: Impossible, sk->sk_state=%d\n",
__func__, sk->sk_state);
break;
}
/* It _is_ possible, that we have something out-of-order _after_ FIN.
* Probably, we should reset in this case. For now drop them.
*/
__skb_queue_purge(&tp->out_of_order_queue);
if (tcp_is_sack(tp))
tcp_sack_reset(&tp->rx_opt);
sk_mem_reclaim(sk);
if (!sock_flag(sk, SOCK_DEAD)) {
sk->sk_state_change(sk);
/* Do not send POLL_HUP for half duplex close. */
if (sk->sk_shutdown == SHUTDOWN_MASK ||
sk->sk_state == TCP_CLOSE)
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_HUP);
else
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
}
}
sock_fasync()
實現了對套接口的異步通知隊列增加和刪除的更新操作。因爲它在進程上下文中或在軟中斷被調用,因此,在訪問異步通知列表時需要上鎖,對套接口上鎖,對傳輸控制塊上sk_callback_lock鎖
/*
* Update the socket async list
*
* Fasync_list locking strategy.
*
* 1. fasync_list is modified only under process context socket lock
* i.e. under semaphore.
* 2. fasync_list is used under read_lock(&sk->sk_callback_lock)
* or under socket lock.
* 3. fasync_list can be used from softirq context, so that
* modification under socket lock have to be enhanced with
* write_lock_bh(&sk->sk_callback_lock).
* --ANK (990710)
*/
static int sock_fasync(int fd, struct file *filp, int on)
{
struct fasync_struct *fa, *fna = NULL, **prev;
struct socket *sock;
struct sock *sk;
if (on) {
fna = kmalloc(sizeof(struct fasync_struct), GFP_KERNEL);
if (fna == NULL)
return -ENOMEM;
}
sock = filp->private_data;
sk = sock->sk;
if (sk == NULL) {
kfree(fna);
return -EINVAL;
}
lock_sock(sk);
spin_lock(&filp->f_lock);
if (on)
filp->f_flags |= FASYNC;
else
filp->f_flags &= ~FASYNC;
spin_unlock(&filp->f_lock);
prev = &(sock->fasync_list);
for (fa = *prev; fa != NULL; prev = &fa->fa_next, fa = *prev)
if (fa->fa_file == filp)
break;
if (on) {
if (fa != NULL) {
write_lock_bh(&sk->sk_callback_lock);
fa->fa_fd = fd;
write_unlock_bh(&sk->sk_callback_lock);
kfree(fna);
goto out;
}
fna->fa_file = filp;
fna->fa_fd = fd;
fna->magic = FASYNC_MAGIC;
fna->fa_next = sock->fasync_list;
write_lock_bh(&sk->sk_callback_lock);
sock->fasync_list = fna;
write_unlock_bh(&sk->sk_callback_lock);
} else {
if (fa != NULL) {
write_lock_bh(&sk->sk_callback_lock);
*prev = fa->fa_next;
write_unlock_bh(&sk->sk_callback_lock);
kfree(fa);
}
}
out:
release_sock(sock->sk);
return 0;
}