Redis的epoll模型

之前相關文章推薦:Redis高性能與epoll

本文，我們從源代碼的角度，簡單理解Redis是如何使用epoll以及epoll的實現原理。淺入淺出~ 找我交流

通過本文了解如下三件事兒，就算是達到了本文目的：

1、epoll是Linux提供的系統實現，核心方法只有三個

2、epoll效率高，是因爲基於紅黑樹、雙向鏈表、事件回調機制

3、redis的IO多路複用，Linux上用epoll進行了實現

epoll是Linux內核提供的一種多路複用器，照例問問Linux的男人：

EPOLL(7) Linux Programmer’s Manual EPOLL(7)

NAME
epoll - I/O event notification facility

SYNOPSIS
#include <sys/epoll.h>

DESCRIPTION
The epoll API performs a similar task to poll(2): monitoring multiple file
descriptors to see if I/O is possible on any of them. The epoll API can
be used either as an edge-triggered or a level-triggered interface and
scales well to large numbers of watched file descriptors. The following
system calls are provided to create and manage an epoll instance:

   *  epoll_create(2) creates an epoll instance and returns a file descriptor
      referring to that instance.  (The more recent epoll_create1(2)  extends
      the functionality of epoll_create(2).)

   *  Interest   in  particular  file  descriptors  is  then  registered  via
      epoll_ctl(2).  The set of file descriptors currently registered  on  an
      epoll instance is sometimes called an epoll set.
   *  epoll_wait(2)  waits  for I/O events, blocking the calling thread if no
      events are currently available.

核心方法
man告訴我們epoll的定義在sys/epoll.h中，查看核心函數有3個：（在線代碼
elixir.bootlin.com/linux/v4.19… ）

epoll_create
epoll_create(int size)

核心功能：

創建一個epoll文件描述符
創建eventpoll，其中包含紅黑樹cache和雙向鏈表
參數size並不是限制了epoll所能監聽的文件描述符最大個數，只是對內核初始分配內部數據結構的一個建議。在Linux 2.6.8後，size 參數被忽略，但是必須傳一個比 0 大的數。

調用epoll_create後，會佔用一個fd值。在Linux下可以查看/proc/$$/fd/ 文件描述符。使用完，需要調用close關閉。

eventpoll代碼片段：

struct eventpoll {
/*
* This mutex is used to ensure that files are not removed
* while epoll is using them. This is held during the event
* collection loop, the file cleanup path, the epoll file exit
* code and the ctl operations.
*/
struct mutex mtx;

/* Wait queue used by sys_epoll_wait() */
wait_queue_head_t wq;

/* Wait queue used by file->poll() */
wait_queue_head_t poll_wait;

/* List of ready file descriptors */
struct list_head rdllist;//就緒列表，採用雙向鏈表

/* RB tree root used to store monitored fd structs */
struct rb_root_cached rbr;//紅黑樹，保存存活的fd

/*
 * This is a single linked list that chains all the "struct epitem" that
 * happened while transferring ready events to userspace w/out
 * holding ->wq.lock.
 */
struct epitem *ovflist;

/* wakeup_source used when ep_scan_ready_list is running */
struct wakeup_source *ws;

/* The user that created the eventpoll descriptor */
struct user_struct *user;

struct file *file;

/* used to optimize loop detection check */
int visited;
struct list_head visited_list_link;

#ifdef CONFIG_NET_RX_BUSY_POLL
/* used to track busy poll napi_id */
unsigned int napi_id;
#endif
};
epollctl
int epollctl(int epfd, int op, int fd, struct epollevent *event)；

核心功能：

對指定描述符fd執行op的綁定操作
把fd寫入紅黑樹，同時在內核註冊回調函數
op操作類型，用三個宏EPOLL_CTL_ADD，EPOLL_CTL_DEL，EPOLL_CTL_MOD，來分別表示增刪改對fd的監聽。

epollwait
int epollwait(int epfd, struct epollevent *events, int maxevents, int timeout);

核心功能：

獲取epfd上的io事件
參數events是就緒事件，用來得到想要獲得的事件集合。maxevents表示的events有多大，maxevents的值必須大於0，參數timeout是超時時間。epollwait會阻塞，直到一個文件描述符觸發了事件，或者被一個信號處理函數打斷，或者timeout超時。返回值是需要處理的fd數量。

工作機制
建立高速緩存（紅黑樹）和待讀取列表（雙向鏈表）
對要監控的fd（一切都是fd，參考 NIO 看破也說破（一）—— Linux/IO 基礎），進行事件綁定。事件發生，通過callback放入待讀取列表
阻塞獲取待讀取列表
執行流程
讀懂纔會用 - 瞅瞅Redis的epoll模型
優點
epoll創建的紅黑樹保存所有fd，沒有大小限制，且增刪查的複雜度O(logN)
基於callback，利用系統內核觸發感興趣的事件
就緒列表爲雙線鏈表時間複雜度O(1)
應用獲取到的fd都是真實發生IO的fd，與select 和 poll 需要不斷輪詢判斷是否可用相比，能避免無用的內存拷貝
結合Redis代碼
源碼太多，我們只看和本文相關的模塊

事件處理模塊 ae.c/ae_epoll.c

網路鏈接庫 anet.c 和 networking.c

服務器端 server.c

讀懂纔會用 - 瞅瞅Redis的epoll模型
創建事件管理器
server.c 的 L2702 initServer() 是redis server 的啓動入口，

首先創建 aeEventLoop 對象，在L2743調用 aeCreateEventLoop() ，初始化未就緒文件事件表、就緒文件事件表。events指針指向未就緒文件事件表、fired指針指向就緒文件事件表。

aeEventLoop *aeCreateEventLoop(int setsize) {
aeEventLoop *eventLoop;
int i;

if ((eventLoop = zmalloc(sizeof(*eventLoop))) == NULL) goto err;
eventLoop->events = zmalloc(sizeof(aeFileEvent)*setsize);
eventLoop->fired = zmalloc(sizeof(aeFiredEvent)*setsize);
if (eventLoop->events == NULL || eventLoop->fired == NULL) goto err;
eventLoop->setsize = setsize;
eventLoop->lastTime = time(NULL);
eventLoop->timeEventHead = NULL;
eventLoop->timeEventNextId = 0;
eventLoop->stop = 0;
eventLoop->maxfd = -1;
eventLoop->beforesleep = NULL;
eventLoop->aftersleep = NULL;
eventLoop->flags = 0;
if (aeApiCreate(eventLoop) == -1) goto err;
/* Events with mask == AE_NONE are not set. So let's initialize the
 * vector with it. */
for (i = 0; i < setsize; i++)
    eventLoop->events[i].mask = AE_NONE;
return eventLoop;

err:
if (eventLoop) {
zfree(eventLoop->events);
zfree(eventLoop->fired);
zfree(eventLoop);
}
return NULL;
}
在 ae_epoll.c L39 調用 aeApiCreate 函數，首先創建了 aeApiState 對象，初始化了epoll就緒事件表；然後調用 epoll_create 創建了epoll實例，最後將該 aeApiState 賦值給apidata屬性

static int aeApiCreate(aeEventLoop *eventLoop) {
aeApiState *state = zmalloc(sizeof(aeApiState));

if (!state) return -1;
state->events = zmalloc(sizeof(struct epoll_event)*eventLoop->setsize);
if (!state->events) {
    zfree(state);
    return -1;
}
state->epfd = epoll_create(1024); /* 1024 is just a hint for the kernel */
if (state->epfd == -1) {
    zfree(state->events);
    zfree(state);
    return -1;
}
eventLoop->apidata = state;
return 0;

}
綁定事件
aeFileEvent 是文件事件結構，對於每一個具體的事件，都有讀處理函數和寫處理函數。Redis 調用 aeCreateFileEvent 函數針對不同的套接字的讀寫事件，註冊對應的文件事件。

/* File event structure /
typedef struct aeFileEvent {
int mask; / one of AE_(READABLE|WRITABLE|BARRIER) */
aeFileProc *rfileProc;//讀
aeFileProc *wfileProc;//寫
void *clientData;
} aeFileEvent;
server.c L2848 aeCreateFileEvent 創建文件事件，執行 ae_epoll.c L73 aeApiAddEvent

static int aeApiAddEvent(aeEventLoop *eventLoop, int fd, int mask) {
aeApiState state = eventLoop->apidata;
struct epoll_event ee = {0}; / avoid valgrind warning /
/ If the fd was already monitored for some event, we need a MOD
* operation. Otherwise we need an ADD operation. */
int op = eventLoop->events[fd].mask == AE_NONE ?
EPOLL_CTL_ADD : EPOLL_CTL_MOD;

ee.events = 0;
mask |= eventLoop->events[fd].mask; /* Merge old events */
if (mask & AE_READABLE) ee.events |= EPOLLIN;
if (mask & AE_WRITABLE) ee.events |= EPOLLOUT;
ee.data.fd = fd;
if (epoll_ctl(state->epfd,op,fd,ⅇ) == -1) return -1;
return 0;

}
aeApiAddEvent 調用系統 epoll_ctl ，註冊事件

處理事件
server.c倒數第三行，調用 aeMain 方法

void aeMain(aeEventLoop *eventLoop) {
eventLoop->stop = 0;
while (!eventLoop->stop) {
if (eventLoop->beforesleep != NULL)
eventLoop->beforesleep(eventLoop);
aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP);
}
}
aeProcessEvents 方法中針對事件和文件事件處理，在ae.c L433 調用 aeApiPoll ，方法具體實現在ae_poll.c L108：

static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
aeApiState *state = eventLoop->apidata;
int retval, numevents = 0;

retval = epoll_wait(state->epfd,state->events,eventLoop->setsize,
        tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);
if (retval > 0) {
    int j;

    numevents = retval;
    for (j = 0; j < numevents; j++) {
        int mask = 0;
        struct epoll_event *e = state->events+j;

        if (e->events & EPOLLIN) mask |= AE_READABLE;
        if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
        if (e->events & EPOLLERR) mask |= AE_WRITABLE|AE_READABLE;
        if (e->events & EPOLLHUP) mask |= AE_WRITABLE|AE_READABLE;
        eventLoop->fired[j].fd = e->data.fd;
        eventLoop->fired[j].mask = mask;
    }
}
return numevents;

}
調用 epoll_wait 阻塞等待epoll的事件就緒，超時時間就是之前根據最快達到時間事件計算而來的超時時間；然後將就緒的epoll事件轉換到fired就緒事件。 aeApiPoll 就是上文所說的I/O多路複用程序。

結論
epoll_create 創建就緒列表
epoll_ctl綁定事件，事件發生時fd到就緒列表
epoll_wait讀取就緒列表

Redis的epoll模型

Python實現大麥網搶票的四大關鍵技術點解析

https://www.cnblogs.com/yangxd1994/p/12083014.html

XXL-JOB快速入門

【Java進階面試系列之四】扎心！線上服務宕機時，如何保證數據100%不丟失？【石杉的架構筆記】

centos7 開啓mysql5.7遠程連接授權並連接

一臺java服務器可以跑多少個線程？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結