對於Linux內核多路複用技術相比大家都有一定的瞭解，從select、poll到epoll，無一不是對前者的升級，這篇文建將要簡單講解epoll在內核中的實現。本文中提及的內核版本是 4.20.11.注意這裏的epoll並不是evolution poll，而是event poll，二者沒有必要的聯繫，epoll也並不是針對poll進行的改進，再加上select，三者爲單線程多任務的模擬。比如我們瞭解的redis、nginx都有用到epoll技術，感興趣的朋友可以自行谷歌或必應。

圖片來自網絡，侵刪。

Table of Contents

用戶態server-client epoll實現

介紹

在Linux內核4.20.111 include/linux/syscalls.h裏有這樣的聲明。

asmlinkage long sys_epoll_create(int size);

在用戶態中他長這樣（參見man epoll_create）

#include <sys/epoll.h>
int epoll_create(int size);
int epoll_create1(int flags);

如man手冊所說，size參數從Linux 2.6.8被廢棄

epoll_create() creates a new epoll(7) instance.  Since Linux 2.6.8,
the size argument is ignored, but must be greater than zero; see
NOTES.

用戶態server-client epoll實現

這裏有一篇關於epoll的socket S-C代碼實現：

https://rtoax.blog.csdn.net/article/details/81047943

內核中的epoll追蹤

就像上面服務端-客戶端（S-C）實現一樣，很多小夥伴認爲epoll是屬於內核中的網絡，實則不然，其實它屬於文件系統，別忘了，epoll_create的函數返回值是個fd（文件描述符），再看一下代碼的存放路徑:

\linux-4.20.11\fs\eventpoll.c

而對於用戶態的epoll_create，在改代碼中的系統調用爲

SYSCALL_DEFINE1(epoll_create1, int, flags)
{
  return do_epoll_create(flags);
}
SYSCALL_DEFINE1(epoll_create, int, size)
{
  if (size <= 0)
    return -EINVAL;
  return do_epoll_create(0);
}

當然，參數 size 被棄用。

我們已經找到了epoll_create的系統調用原型

/*
 * Open an eventpoll file descriptor.
 */
static int do_epoll_create(int flags);

內核中的epoll_create

當然是先看結構體（爲簡化只保留關鍵代碼）

struct file;
struct eventpoll;
struct epitem;
struct epoll_event;
struct eppoll_entry;
struct ep_pqueue;
struct epoll_filefd;

簡化的流程如下圖

內核中的epoll_ctl

系統調用原型爲

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
    struct epoll_event __user *, event);

再看這一段代碼

  switch (op) {
  case EPOLL_CTL_ADD:
    if (!epi) {
      epds.events |= EPOLLERR | EPOLLHUP;
      error = ep_insert(ep, &epds, tf.file, fd, full_check);
    } else
      error = -EEXIST;
    if (full_check)
      clear_tfile_check_list();
    break;
  case EPOLL_CTL_DEL:
    if (epi)
      error = ep_remove(ep, epi);
    else
      error = -ENOENT;
    break;
  case EPOLL_CTL_MOD:
    if (epi) {
      if (!(epi->event.events & EPOLLEXCLUSIVE)) {
        epds.events |= EPOLLERR | EPOLLHUP;
        error = ep_modify(ep, epi, &epds);
      }
    } else
      error = -ENOENT;
    break;
  }

這段代碼告訴我們ADD和DEL兩個操作是如何操作的。

首先看下epoll_ctl涉及到的關鍵全局變量。

然後，開始進行slab分配，關注一句

if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
    return -ENOMEM;

其中kmem_cache_alloc 涉及到slab的只是，此處不做詳述，功能是從緩存epi_cache中申請固定頁的內存。當然這個類型是struct epitem *epi;這是epoll的一項，也就是一個fd（文件描述符）。接下來初始化一系列鏈表頭

/* Item initialization follow here ... */
  INIT_LIST_HEAD(&epi->rdllink);
  INIT_LIST_HEAD(&epi->fllink);
  INIT_LIST_HEAD(&epi->pwqlist);

設定fd，用於紅黑樹節點的索引。

/* Setup the structure that is used as key for the RB tree */
static inline void ep_set_ffd(struct epoll_filefd *ffd,
            struct file *file, int fd)
{
  ffd->file = file;
  ffd->fd = fd;
}

將fllink節點插入tfile的f_ep_links爲頭結點的雙向鏈表中。

list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);

將item插入紅黑樹（紅黑樹：一種平衡二叉樹，可用於快速索引，一個整形的遍歷最高也就十幾次的計算複雜度）

/*
   * Add the current item to the RB tree. All RB tree operations are
   * protected by "mtx", and ep_insert() is called with "mtx" held.
   */
  ep_rbtree_insert(ep, epi);

反過來的remove操作

/*
 * Removes a "struct epitem" from the eventpoll RB tree and deallocates
 * all the associated resources. Must be called with "mtx" held.
 */
static int ep_remove(struct eventpoll *ep, struct epitem *epi)

當然，從鏈表中刪除item

list_del_rcu(&epi->fllink)
list_del_init(&epi->rdllink);

從紅黑樹中刪除葉子

rb_erase_cached(&epi->rbn, &ep->rbr);

這裏當然涉及到資源釋放“RCU-read count use”

call_rcu(&epi->rcu, epi_rcu_free);

講了ADD，DEL，MOD也就好解釋了，無非是索引+修改。

/*
 * Modify the interest event mask by dropping an event if the new mask
 * has a match in the current file status. Must be called with "mtx" held.
 */
static int ep_modify(struct eventpoll *ep, struct epitem *epi,
         const struct epoll_event *event)

然後再將其添加到隊尾

list_add_tail(&epi->rdllink, &ep->rdllist);

如下圖給出了epoll的create和ctrl的結構體圖。

內核中的epoll_wait

wait有兩種，用戶態接口爲epoll_wait和epoll_pwait，二者區別在於一個signal，此處不做講解，以epoll_wait爲例。

epoll_wait的系統調用爲

SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
    int, maxevents, int, timeout)
{
  return do_epoll_wait(epfd, events, maxevents, timeout);
}

do_epoll_wait函數原型爲

/*
 * Implement the event wait interface for the eventpoll file. It is the kernel
 * part of the user space epoll_wait(2).
 */
static int do_epoll_wait(int epfd, struct epoll_event __user *events,
       int maxevents, int timeout);

我們主要關注函數主要做了哪些工作。注意之類將使用wait_queue，可以簡單將其理解爲一方read阻塞，一方write後，read阻塞被打斷，讀取緩衝區，當然這也是一種比喻，實際請讀者自己腦補，他的數據類型爲

wait_queue_entry_t wait;

對於wait只針對關鍵函數進行調用追蹤，先下面這個“事件發生”判斷

/* Is it worth to try to dig for events ? */
  eavail = ep_events_available(ep);

以及這個事件真的發生了

/*
   * Try to transfer events to user space. In case we get 0 events and
   * there's still timeout left over, we go trying again in search of
   * more luck.
   */
  if (!res && eavail &&
      !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
    goto fetch_events;

提到一嘴，這裏的宏current 代表了運行了這段代碼的內核進程，也就是結構task_struct（詳情請自行學習進程管理相關知識）。

下面我們就關注ep_send_events這個函數，裏面就只scan了一個ready list

static int ep_send_events(struct eventpoll *ep,
        struct epoll_event __user *events, int maxevents)
{
  struct ep_send_events_data esed;

  esed.maxevents = maxevents;
  esed.events = events;

  ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
  return esed.res;
}

初始化一個鏈表，並肩所有ready的節點拼接到這個鏈表上

LIST_HEAD(txlist);
list_splice_init(&ep->rdllist, &txlist);

然後調用這個回調proc函數

/*
   * Now call the callback function.
   */
  res = (*sproc)(ep, &txlist, priv);

這個回調函數中將該ready的節點刪除，函數原型爲

static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
             void *priv);

然後就是這一段epoll_wait將返回的代碼

if (!list_empty(&ep->rdllist)) {
    /*
     * Wake up (if active) both the eventpoll wait list and
     * the ->poll() wait list (delayed after we release the lock).
     */
    if (waitqueue_active(&ep->wq))
      wake_up_locked(&ep->wq);
    if (waitqueue_active(&ep->poll_wait))
      pwake++;
  }
  spin_unlock_irq(&ep->wq.lock);

  if (!ep_locked)
    mutex_unlock(&ep->mtx);

  /* We have to call this outside the lock */
  if (pwake)
    ep_poll_safewake(&ep->poll_wait);

當然，其中可能涉及到中斷的知識，此文也不做講解，

static void ep_poll_safewake(wait_queue_head_t *wq)
{
  int this_cpu = get_cpu();

  ep_call_nested(&poll_safewake_ncalls, EP_MAX_NESTS,
           ep_poll_wakeup_proc, NULL, wq, (void *) (long) this_cpu);

  put_cpu();
}

本文就知識簡單講解epoll的實現機制。感興趣的朋友可以繼續閱讀源碼瞭解。

淺談epoll

介紹

用戶態server-client epoll實現

內核中的epoll追蹤

內核中的epoll_create

內核中的epoll_ctl

內核中的epoll_wait

【SQL進階】CASE語句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

DPDK單生產者入隊單消費者出隊

DPDK網絡處理模塊劃分

向量封包處理器（VPP）如何運作

Linux平臺上DPDK入門指南

DPDK如何釋放大頁內存（巨頁內存hugepage）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結