Twemproxy源碼走讀（5）：事件處理

概述

Twemproxy中的IO複用考慮了跨平臺的情況，針對不同平臺採用不同的IO複用機制，比如Linux下使用epoll、FreeBSD使用kqueue等，在event目錄下都有實現，所有的IO複用機制對外實現了統一的接口（event/nc_event.h）：

struct event_base *event_base_create(int size, event_cb_t cb);
void event_base_destroy(struct event_base *evb);

int event_add_in(struct event_base *evb, struct conn *c);
int event_del_in(struct event_base *evb, struct conn *c);
int event_add_out(struct event_base *evb, struct conn *c);
int event_del_out(struct event_base *evb, struct conn *c);
int event_add_conn(struct event_base *evb, struct conn *c);
int event_del_conn(struct event_base *evb, struct conn *c);
int event_wait(struct event_base *evb, int timeout);
void event_loop_stats(event_stats_cb_t cb, void *arg);

這樣，方便不同平臺的使用者無需考慮各種IO複用機制之間的不同。

在講解網絡通信、事件處理流程之前，需要了解NC配置文件的格式，以及其中的配置項，便於後續的講解，簡單的配置格式如下所示：

beta:
  listen: 127.0.0.1:22122
  hash: fnv1a_64
  hash_tag: "{}"
  distribution: ketama
  auto_eject_hosts: false
  timeout: 400 
  redis: true
  servers:
   - 127.0.0.1:6380:1 server1
   - 127.0.0.1:6381:1 server2
   - 127.0.0.1:6382:1 server3
   - 127.0.0.1:6383:1 server4

gamma:
  listen: 127.0.0.1:22123
  hash: fnv1a_64
  distribution: ketama
  timeout: 400 
  backlog: 1024
  preconnect: true
  auto_eject_hosts: true
  client_connections: 10000
  server_connections: 10000
  server_retry_timeout: 2000
  server_failure_limit: 3
  servers:
   - 127.0.0.1:11212:1
   - 127.0.0.1:11213:1

其中的參數意義大部分看字面意思都能知道了

Listen:監聽的IP:Port

Hash:對命令中的key進行hash

Distribution：對命令分發的負載均衡的算法

Timeout:等待回覆的超時時間（單位：毫秒）

Redis：標識後端服務器是redis還是memcached

Backlog：socket API listen的參數，即等待接收連接的隊列的最大長度

Preconnect：是否預先（即啓動後就）跟後端服務器（redis/memcached）建立連接

Auto_eject_hosts：對於無響應的後端服務器是否自動剔除

Client_connections：最大客戶端連接數

Server_connections：最大服務端連接數

Server_retry_timeout：重試超時時間（單位：毫秒）

Server_failure_limit：最大重試次數

Servers：後端服務器信息，三個值分別是：IP:Port:Weight（權重）

數據結構

（1）連接的管理——連接池（free_connq）

爲了維護客戶端和proxy的連接，以及proxy和server之間的連接，NC設計了一個雙向鏈表（TAILQ）來管理每個client或server的連接隊列，並間接實現了lru功能。

conn是twemproxy一個非常重要的結構，客戶端到twemproxy的連接、twemproxy到後端server的連接，以及 proxy本身監聽的tcp端口都可以抽象爲一個conn。同時對於一個conn來說，也有不同的種類，例如：

proxy本身監聽的端口所在的tcp套接字，就屬於“proxy”；

客戶端到proxy的連接conn就屬於一個”client“，即proxy監聽來自客戶端的連接請求，執行過accept調用返回的文件描述符，就是client連接；

proxy到後端redis/memcached server的連接就屬於”server”。

client：這個元素標識這個conn是一個client還是一個server。

proxy：這個元素標識這個conn是否是一個proxy。

owner：這個元素標識這個conn的屬主。

縱觀twemproxy, 他裏邊的conn有三種，client, server, proxy。當這個conn是一個proxy或者client時，則它此時的owner就是server_pool；而當這個conn是一個server時，則它此時的owner就是server，

proxy類型的連接算不得一個真正的連接，它只是在監聽來自客戶端的連接，當有客戶端連接到來時，經過三次握手之後，就建立了一個client類型的連接，proxy繼續執行監聽。

結構體struct conn表示一個連接（nc_connection.h），

struct conn {
    TAILQ_ENTRY(conn)   conn_tqe;        /* link in server_pool / server / free q */
    void                *owner;          /* connection owner - server_pool / server */

    int                 sd;              /* socket descriptor */
    int                 family;          /* socket address family */
    socklen_t           addrlen;         /* socket length */
    struct sockaddr     *addr;           /* socket address (ref in server or server_pool) */

    struct msg_tqh      imsg_q;          /* incoming request Q */
    struct msg_tqh      omsg_q;          /* outstanding request Q */
    struct msg          *rmsg;           /* current message being rcvd */
    struct msg          *smsg;           /* current message being sent */

    conn_recv_t         recv;            /* recv (read) handler */
    conn_recv_next_t    recv_next;       /* recv next message handler */
    conn_recv_done_t    recv_done;       /* read done handler */
    conn_send_t         send;            /* send (write) handler */
    conn_send_next_t    send_next;       /* write next message handler */
    conn_send_done_t    send_done;       /* write done handler */
    conn_close_t        close;           /* close handler */
    conn_active_t       active;          /* active? handler */
    conn_post_connect_t post_connect;    /* post connect handler */
    conn_swallow_msg_t  swallow_msg;     /* react on messages to be swallowed */

    conn_ref_t          ref;             /* connection reference handler */
    conn_unref_t        unref;           /* connection unreference handler */

    conn_msgq_t         enqueue_inq;     /* connection inq msg enqueue handler */
    conn_msgq_t         dequeue_inq;     /* connection inq msg dequeue handler */
    conn_msgq_t         enqueue_outq;    /* connection outq msg enqueue handler */
    conn_msgq_t         dequeue_outq;    /* connection outq msg dequeue handler */

    size_t              recv_bytes;      /* received (read) bytes */
    size_t              send_bytes;      /* sent (written) bytes */

    uint32_t            events;          /* connection io events */
    err_t               err;             /* connection errno */
    unsigned            recv_active:1;   /* recv active? */
    unsigned            recv_ready:1;    /* recv ready? */
    unsigned            send_active:1;   /* send active? */
    unsigned            send_ready:1;    /* send ready? */

    unsigned            client:1;        /* client? or server? */
    unsigned            proxy:1;         /* proxy? */
    unsigned            connecting:1;    /* connecting? */
    unsigned            connected:1;     /* connected? */
    unsigned            eof:1;           /* eof? aka passive close? */
    unsigned            done:1;          /* done? aka close? */
    unsigned            redis:1;         /* redis? */
    unsigned            authenticated:1; /* authenticated? */
};

其中主要包括：

l 因爲連接是一個雙向尾隊列，需要每個conn保存其前（tqe_pre）後（tqe_next）的元素，就是TAILQ_ENTRY conn_tqe；

l 跟socket套接字相關的，addr/port/family

l 發送/接收請求包；

l 各種處理回調函數；

l 統計相關，接收/發送字節；

l 關注的事件events；

l 各種開關和狀態；

每次需要建立新的連接（包括與client端和server端），都從連接池中取一個空閒連接。

（2）服務端

運行上下文struct context *ctx定義中包含一個變量：

struct array pool; /*server_pool[] */

即一個ctx包含一個server_pool的數組，包含多個server_pool，而一個server_pool顧名思義，是一個server池（數組），包含多個server。

一個server_pool對應於配置信息中的一個塊，比如上面的配置信息中的beta和gamma分別是一個server_pool；server對應於server_pool裏的server段，比如上面的beta有四個server。

server_pool和server的關係截取源碼的描述大體如下（nc_server.h）：

/*
 * server_pool is a collection of servers and their continuum. Each
 * server_pool is the owner of a single proxy connection and one or
 * more client connections. server_pool itself is owned by the current
 * context.
 *
 * Each server is the owner of one or more server connections. server
 * itself is owned by the server_pool.
 *
 *  +-------------+
 *  |             |<---------------------+
 *  |             |<------------+        |
 *  |             |     +-------+--+-----+----+--------------+
 *  |   pool 0    |+--->|          |          |              |
 *  |             |     | server 0 | server 1 | ...     ...  |
 *  |             |     |          |          |              |--+
 *  |             |     +----------+----------+--------------+  |
 *  +-------------+                                             //
 *  |             |
 *  |             |
 *  |             |
 *  |   pool 1    |
 *  |             |
 *  |             |
 *  |             |
 *  +-------------+
 *  |             |
 *  |             |
 *  .             .
 *  .    ...      .
 *  .             .
 *  |             |
 *  |             |
 *  +-------------+
 *            |
 *            |
 *            //
 */

二者定義如下(nc_server.h)：

struct server {
    uint32_t           idx;           /* server index */
    struct server_pool *owner;        /* owner pool */

    struct string      pname;         /* hostname:port:weight (ref in conf_server) */
    struct string      name;          /* hostname:port or [name] (ref in conf_server) */
    struct string      addrstr;       /* hostname (ref in conf_server) */
    uint16_t           port;          /* port */
    uint32_t           weight;        /* weight */
    struct sockinfo    info;          /* server socket info */

    uint32_t           ns_conn_q;     /* # server connection */
    struct conn_tqh    s_conn_q;      /* server connection q */

    int64_t            next_retry;    /* next retry time in usec */
    uint32_t           failure_count; /* # consecutive failures */
};

struct server_pool {
    uint32_t           idx;                  /* pool index */
    struct context     *ctx;                 /* owner context */

    struct conn        *p_conn;              /* proxy connection (listener) */
    uint32_t           nc_conn_q;            /* # client connection */
    struct conn_tqh    c_conn_q;             /* client connection q */

    struct array       server;               /* server[] */
    uint32_t           ncontinuum;           /* # continuum points */
    uint32_t           nserver_continuum;    /* # servers - live and dead on continuum (const) */
    struct continuum   *continuum;           /* continuum */
    uint32_t           nlive_server;         /* # live server */
    int64_t            next_rebuild;         /* next distribution rebuild time in usec */

    struct string      name;                 /* pool name (ref in conf_pool) */
    struct string      addrstr;              /* pool address - hostname:port (ref in conf_pool) */
    uint16_t           port;                 /* port */
    struct sockinfo    info;                 /* listen socket info */
    mode_t             perm;                 /* socket permission */
    int                dist_type;            /* distribution type (dist_type_t) */
    int                key_hash_type;        /* key hash type (hash_type_t) */
    hash_t             key_hash;             /* key hasher */
    struct string      hash_tag;             /* key hash tag (ref in conf_pool) */
    int                timeout;              /* timeout in msec */
    int                backlog;              /* listen backlog */
    int                redis_db;             /* redis database to connect to */
    uint32_t           client_connections;   /* maximum # client connection */
    uint32_t           server_connections;   /* maximum # server connection */
    int64_t            server_retry_timeout; /* server retry timeout in usec */
    uint32_t           server_failure_limit; /* server failure limit */
    struct string      redis_auth;           /* redis_auth password (matches requirepass on redis) 
*/
    unsigned           require_auth;         /* require_auth? */
    unsigned           auto_eject_hosts:1;   /* auto_eject_hosts? */
    unsigned           preconnect:1;         /* preconnect? */
    unsigned           redis:1;              /* redis? */
    unsigned           tcpkeepalive:1;       /* tcpkeepalive? */
};

server_pool中保存了到客戶端的連接，server中保存了到服務端的連接，連接的存儲都是使用了struct conn_tqh結構，底層使用雙向鏈表（TAILQ）做存儲介質。

在初始化時，會利用配置文件中的servers信息構造成一個個server變量，然後存入ctx中的server_pool數組中：

    /* initialize server pool fromconfiguration */
    status =server_pool_init(&ctx->pool, &ctx->cf->pool, ctx);

然後將該server_pool的owner設置爲ctx：

    /*set ctx as the server pool owner */
    status= array_each(server_pool, server_pool_each_set_owner, ctx);

然後計算ctx的server_pool最大可建立的server連接數

    /* compute max server connections */
    ctx->max_nsconn = 0;
    status = array_each(server_pool,server_pool_each_calc_connections, ctx);

配置文件的參數中有個配置參數：server_connections，記錄server_pool中的每個server可以建立的server端連接數的最大值（該server_pool中的所有server公用這個值，有相同的server_connections）。一個server_pool可以建立的server端連接數 = server_connections * ((ctx->server_pool).size())+ 1，”1”代表一個server_pool有一個用於監聽來自客戶端連接的監聽套接字，代碼如下（nc_server.c）：

static rstatus_t
server_pool_each_calc_connections(void *elem, void *data)
{
    struct server_pool *sp = elem;
    struct context *ctx = data;

    ctx->max_nsconn += sp->server_connections * array_n(&sp->server);
    ctx->max_nsconn += 1; /* pool listening socket */

    return NC_OK;
}

所以，整個ctx可以建立的server端連接數就是所有server_pool的server端連接數的加和。

然後更新ctx->server_pool中的每一個server，採用什麼方式分發，取決於配置文件中的” distribution”參數配置，有三種：KETAMA / MODULA/ RANDOM，具體每一種的邏輯是怎樣的，可查看每一種的實現文件hashkit/nc_ketama.c、hashkit/nc_modula.c、hashkit/nc_random.c。

如果配置信息中的preconnect值爲true，則在初始化時將建立與ctx->server_pool中的每一個server_pool中的每一個server的連接。

首先調用server_conn，從連接池中獲取一個空閒連接，其次，server_connect中建立與server的TCP連接；最後，將該連接對應的conn結構體傳入event(event_add_conn)，conn結構體中含有該連接的套接字描述符。

（3）客戶端

客戶端也有最大連接數，客戶端的最大連接數是在服務端最大連接數的基礎上計算出來的，即客戶端最大連接數=系統允許進程最大的打開文件數-服務端最大連接數-保留的文件描述符數（nc_core.c）。

    status = getrlimit(RLIMIT_NOFILE, &limit);
    if (status < 0) {
        log_error("getrlimit failed: %s", strerror(errno));
        return NC_ERROR;
    }   

    ctx->max_nfd = (uint32_t)limit.rlim_cur;
    ctx->max_ncconn = ctx->max_nfd - ctx->max_nsconn - RESERVED_FDS;

（4）請求包

（5）應答包

請求處理流程

NC沒有使用libevent，而是自己實現的網絡通信庫，採用單線程+非阻塞I/O+I/O多路複用實現的Reactor模式，將事件與發生該事件的連接(conn)聯繫在一起，
歸納起來，有五類事件：
(1).proxy監聽客戶端連接；

一個server_pool對應一個proxy，用於監聽客戶端發來的連接請求（注意：是連接請求，不是數據請求）。在初始化時，要初始化Proxy，主要工作就是對ctx->server_pool中的每一個server_pool從連接池中獲取空閒連接（conn），並加入evb中，建立監聽：

    status =event_add_conn(ctx->evb, p);
    status =event_del_out(ctx->evb, p);
    注：event_add_conn增加新的連接監控

該conn的處理回調函數有別於數據請求的回調處理函數，比如：

    conn->recv= proxy_recv;
    conn->close= proxy_close;
    conn->ref = proxy_ref;
    conn->unref= proxy_unref;

(2).客戶端接收請求；

(3).服務器端發送請求；
(4).服務器端接收應答；
(5).客戶端發送應答；
每個NC都對應一個instance結構體，該結構體中包含該實例的event_base實例，event_base中註冊的所有連接上的所有事件的回調函數都是core_core（nc_core.c），然後根據發生事件的連接（在初始化時，初始了各類函數指針）找到對應的函數指針，並調用。
處理過程如下：
每個client和server連接都各有一個in_q和一個out_q，爲便於區分，分別起名字爲c_inq/c_outq和s_inq/s_outq，首先請求到達c_inq，觸發client<=>proxy(nc)連接上的recv函數，client端接收後，經過parse、filter等操作，請求從c_inq一方面放到c_outq(如果該請求需要應答的話)，另一方面放到選擇的某個server的s_inq中，同時修改該server對應連接的事件，增加event_out事件，
過程大致如下：

core_core->core_recv->msg_recv->req_recv_next->req_recv_done->req_filter->req_forward

等下一個event_loop運行時，這就觸發了該server連接的send_out事件，該事件會調用事先初始化的send函數，這個函數會把s_inq中的請求逐個發送（msg_send_chain）給後端的服務器（Memcached/Redis）處理，
過程大致如下：

core_core->core_send->msg_send->req_send_next->req_send_done

處理完成後，返回應答給server，server將應答傳遞給client，client將應答發送給客戶端。
過程不再列舉，大概涉及以下幾個函數：

core_core->core_recv->rsp_recv_next->rsp_recv_done->rsp_filter->rsp_forward
core_core->core_send->msg_send->rsp_send_next->rsp_send_done

源碼裏有一張圖清楚地描述了整個過程(nc_message.c)：

 * Note that in the above discussion, the terminology send is used
 * synonymously with write or OUT event. Similarly recv is used synonymously
 * with read or IN event
 *
 *             Client+             Proxy           Server+
 *                              (nutcracker)
 *                                   .
 *       msg_recv {read event}       .       msg_recv {read event}
 *         +                         .                         +
 *         |                         .                         |
 *         \                         .                         /
 *         req_recv_next             .             rsp_recv_next
 *           +                       .                       +
 *           |                       .                       |       Rsp
 *           req_recv_done           .           rsp_recv_done      <===
 *             +                     .                     +
 *             |                     .                     |
 *    Req      \                     .                     /
 *    ===>     req_filter*           .           *rsp_filter
 *               +                   .                   +
 *               |                   .                   |
 *               \                   .                   /
 *               req_forward-//  (a) . (c)  \\-rsp_forward
 *                                   .
 *                                   .
 *       msg_send {write event}      .      msg_send {write event}
 *         +                         .                         +
 *         |                         .                         |
 *    Rsp' \                         .                         /     Req'
 *   <===  rsp_send_next             .             req_send_next     ===>
 *           +                       .                       +
 *           |                       .                       |
 *           \                       .                       /
 *           rsp_send_done-//    (d) . (b)    //-req_send_done
 *
 *
 * (a) -> (b) -> (c) -> (d) is the normal flow of transaction consisting
 * of a single request response, where (a) and (b) handle request from
 * client, while (c) and (d) handle the corresponding response from the
 * server.

分佈式策略

twemproxy支持3種策略：

ketama:一致性hash的實現

modula:通過強hash取模來對應服務器

radom:隨機分配服務器連接

zero-copy的實現

（未完待續）

Twemproxy源碼走讀（5）：事件處理

Python中map()內建函數淺析

Function Pointers

打開/關閉 HP超極本鼠標觸摸板

python中string的操作函數

python使用CSV實現電話本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結