Memcached源碼分析(線程模型)

http://www.iteye.com/topic/344172

目前網上關於memcached的分析主要是內存管理部分，下面對memcached的線程模型做下簡單分析

有不對的地方還請大家指正,對memcahced和libevent不熟悉的請先google之

先看下memcahced啓動時線程處理的流程

memcached的多線程主要是通過實例化多個libevent實現的,分別是一個主線程和n個workers線程
無論是主線程還是workers線程全部通過libevent管理網絡事件，實際上每個線程都是一個單獨的libevent實例

主線程負責監聽客戶端的建立連接請求，以及accept 連接
workers線程負責處理已經建立好的連接的讀寫等事件

先看一下大致的圖示：

首先看下主要的數據結構(thread.c)：

C代碼  

/* An item in the connection queue. */  

typedef struct conn_queue_item CQ_ITEM;  

struct conn_queue_item {  

    int     sfd;  

    int     init_state;  

    int     event_flags;  

    int     read_buffer_size;  

    int     is_udp;  

    CQ_ITEM *next;  

};

CQ_ITEM 實際上是主線程accept後返回的已建立連接的fd的封裝

C代碼  

/* A connection queue. */  

typedef struct conn_queue CQ;  

struct conn_queue {  

    CQ_ITEM *head;  

    CQ_ITEM *tail;  

    pthread_mutex_t lock;  

    pthread_cond_t  cond;  

};

CQ是一個管理CQ_ITEM的單向鏈表

C代碼  

typedef struct {  

    pthread_t thread_id;        /* unique ID of this thread */  

    struct event_base *base;    /* libevent handle this thread uses */  

    struct event notify_event;  /* listen event for notify pipe */  

    int notify_receive_fd;      /* receiving end of notify pipe */  

    int notify_send_fd;         /* sending end of notify pipe */  

    CQ  new_conn_queue;         /* queue of new connections to handle */  

} LIBEVENT_THREAD;

這是memcached裏的線程結構的封裝，可以看到每個線程都包含一個CQ隊列，一條通知管道pipe
和一個libevent的實例event_base

另外一個重要的最重要的結構是對每個網絡連接的封裝conn

C代碼  

typedef struct{  

  int sfd;  

  int state;  

  struct event event;  

  short which;  

  char *rbuf;  

  ... //這裏省去了很多狀態標誌和讀寫buf信息等  

}conn;

memcached主要通過設置/轉換連接的不同狀態，來處理事件（核心函數是drive_machine）

下面看下線程的初始化流程：

在memcached.c的main函數中，首先對主線程的libevent做了初始化

C代碼  

/* initialize main thread libevent instance */  

 main_base = event_init();  

然後初始化所有的workers線程，並啓動，啓動過程細節在後面會有描述

C代碼  

/* start up worker threads if MT mode */  

thread_init(settings.num_threads, main_base);  

接着主線程調用（這裏只分析tcp的情況，目前memcached支持udp方式）

C代碼  

server_socket(settings.port, 0)  

這個方法主要是封裝了創建監聽socket，綁定地址，設置非阻塞模式並註冊監聽socket的
libevent 讀事件等一系列操作

然後主線程調用

C代碼  

/* enter the event loop */  

event_base_loop(main_base, 0);  

這時主線程啓動開始通過libevent來接受外部連接請求，整個啓動過程完畢

下面看看thread_init是怎樣啓動所有workers線程的，看一下thread_init裏的核心代碼

C代碼  

void thread_init(int nthreads, struct event_base *main_base) {  

 //。。。省略  

   threads = malloc(sizeof(LIBEVENT_THREAD) * nthreads);  

    if (! threads) {  

        perror("Can't allocate thread descriptors");  

        exit(1);  

    }  

    threads[0].base = main_base;  

    threads[0].thread_id = pthread_self();  

    for (i = 0; i < nthreads; i++) {  

        int fds[2];  

        if (pipe(fds)) {  

            perror("Can't create notify pipe");  

            exit(1);  

        }  

        threads[i].notify_receive_fd = fds[0];  

        threads[i].notify_send_fd = fds[1];  

    setup_thread(&threads[i]);  

    }  

    /* Create threads after we've done all the libevent setup. */  

    for (i = 1; i < nthreads; i++) {  

        create_worker(worker_libevent, &threads[i]);  

    }  

}

threads的聲明是這樣的
static LIBEVENT_THREAD *threads;

thread_init首先malloc線程的空間，然後第一個threads作爲主線程，其餘都是workers線程
然後爲每個線程創建一個pipe，這個pipe被用來作爲主線程通知workers線程有新的連接到達

看下setup_thread

C代碼  

static void setup_thread(LIBEVENT_THREAD *me) {  

    if (! me->base) {  

        me->base = event_init();  

        if (! me->base) {  

            fprintf(stderr, "Can't allocate event base\n");  

            exit(1);  

        }  

    }  

    /* Listen for notifications from other threads */  

    event_set(&me->notify_event, me->notify_receive_fd,  

              EV_READ | EV_PERSIST, thread_libevent_process, me);  

    event_base_set(me->base, &me->notify_event);  

    if (event_add(&me->notify_event, 0) == -1) {  

        fprintf(stderr, "Can't monitor libevent notify pipe\n");  

        exit(1);  

    }  

    cq_init(&me->new_conn_queue);  

}

setup_thread主要是創建所有workers線程的libevent實例（主線程的libevent實例在main函數中已經建立）

由於之前 threads[0].base = main_base;所以第一個線程（主線程）在這裏不會執行event_init()
然後就是註冊所有workers線程的管道讀端的libevent的讀事件，等待主線程的通知
最後在該方法裏將所有的workers的CQ初始化了

create_worker實際上就是真正啓動了線程，pthread_create調用worker_libevent方法，該方法執行
event_base_loop啓動該線程的libevent

這裏我們需要記住每個workers線程目前只在自己線程的管道的讀端有數據時可讀時觸發，並調用
thread_libevent_process方法

看一下這個函數

C代碼  

static void thread_libevent_process(int fd, short which, void *arg){  

    LIBEVENT_THREAD *me = arg;  

    CQ_ITEM *item;  

    char buf[1];  

    if (read(fd, buf, 1) != 1)  

        if (settings.verbose > 0)  

            fprintf(stderr, "Can't read from libevent pipe\n");  

    item = cq_peek(&me->new_conn_queue);  

    if (NULL != item) {  

        conn *c = conn_new(item->sfd, item->init_state, item->event_flags,  

                           item->read_buffer_size, item->is_udp, me->base);  

        。。。//省略  

    }  

}

函數參數的fd是這個線程的管道讀端的描述符
首先將管道的1個字節通知信號讀出（這是必須的，在水平觸發模式下如果不處理該事件，則會被循環通知，知道事件被處理）

cq_peek是從該線程的CQ隊列中取隊列頭的一個CQ_ITEM,這個CQ_ITEM是被主線程丟到這個隊列裏的，item->sfd是已經建立的連接
的描述符，通過conn_new函數爲該描述符註冊libevent的讀事件，me->base是代表自己的一個線程結構體，就是說對該描述符的事件
處理交給當前這個workers線程處理,conn_new方法的最重要的內容是：

C代碼  

conn *conn_new(const int sfd, const int init_state, const int event_flags,  

                const int read_buffer_size, const bool is_udp, struct event_base *base) {  

    。。。  

            event_set(&c->event, sfd, event_flags, event_handler, (void *)c);  

        event_base_set(base, &c->event);  

        c->ev_flags = event_flags;  

        if (event_add(&c->event, 0) == -1) {  

        if (conn_add_to_freelist(c)) {  

            conn_free(c);  

        }  

        perror("event_add");  

        return NULL;  

        }  

    。。。  

}

可以看到新的連接被註冊了一個事件（實際是EV_READ|EV_PERSIST）,由當前線程處理（因爲這裏的event_base是該workers線程自己的）
當該連接有可讀數據時會回調event_handler函數，實際上event_handler裏主要是調用memcached的核心方法drive_machine

最後看看主線程是如何通知workers線程處理新連接的，主線程的libevent註冊的是監聽socket描述字的可讀事件，就是說
當有建立連接請求時，主線程會處理，回調的函數是也是event_handler（因爲實際上主線程也是通過conn_new初始化的監聽socket 的libevent可讀事件）

最後看看memcached網絡事件處理的最核心部分- drive_machine
需要銘記於心的是drive_machine是多線程環境執行的，主線程和workers都會執行drive_machine

C代碼  

static void drive_machine(conn *c) {  

    bool stop = false;  

    int sfd, flags = 1;  

    socklen_t addrlen;  

    struct sockaddr_storage addr;  

    int res;  

    assert(c != NULL);  

    while (!stop) {  

        switch(c->state) {  

        case conn_listening:  

            addrlen = sizeof(addr);  

            if ((sfd = accept(c->sfd, (struct sockaddr *)&addr, &addrlen)) == -1) {  

                //省去n多錯誤情況處理  

                break;  

            }  

            if ((flags = fcntl(sfd, F_GETFL, 0)) < 0 ||  

                fcntl(sfd, F_SETFL, flags | O_NONBLOCK) < 0) {  

                perror("setting O_NONBLOCK");  

                close(sfd);  

                break;  

            }  

            dispatch_conn_new(sfd, conn_read, EV_READ | EV_PERSIST,  

                                     DATA_BUFFER_SIZE, false);  

            break;  

        case conn_read:  

            if (try_read_command(c) != 0) {  

                continue;  

            }  

        ....//省略  

     }       

 }

首先大家不到被while循環誤導（大部分做java的同學都會馬上聯想到是個周而復始的loop）其實while通常滿足一個
case後就會break了，這裏用while是考慮到垂直觸發方式下，必須讀到EWOULDBLOCK錯誤纔可以

言歸正傳，drive_machine主要是通過當前連接的state來判斷該進行何種處理，因爲通過libevent註冊了讀寫時間後回調的都是
這個核心函數，所以實際上我們在註冊libevent相應事件時，會同時把事件狀態寫到該conn結構體裏，libevent進行回調時會把
該conn結構作爲參數傳遞過來，就是該方法的形參

memcached裏連接的狀態通過一個enum聲明

C代碼  

enum conn_states {  

    conn_listening,  /** the socket which listens for connections */  

    conn_read,       /** reading in a command line */  

    conn_write,      /** writing out a simple response */  

    conn_nread,      /** reading in a fixed number of bytes */  

    conn_swallow,    /** swallowing unnecessary bytes w/o storing */  

    conn_closing,    /** closing this connection */  

    conn_mwrite,     /** writing out many items sequentially */  

};

實際對於case conn_listening:這種情況是主線程自己處理的，workers線程永遠不會執行此分支
我們看到主線程進行了accept後調用了
dispatch_conn_new(sfd, conn_read, EV_READ | EV_PERSIST,DATA_BUFFER_SIZE, false);

這個函數就是通知workers線程的地方，看看

C代碼  

void dispatch_conn_new(int sfd, int init_state, int event_flags,  

                       int read_buffer_size, int is_udp) {  

    CQ_ITEM *item = cqi_new();  

    int thread = (last_thread + 1) % settings.num_threads;  

    last_thread = thread;  

    item->sfd = sfd;  

    item->init_state = init_state;  

    item->event_flags = event_flags;  

    item->read_buffer_size = read_buffer_size;  

    item->is_udp = is_udp;  

    cq_push(&threads[thread].new_conn_queue, item);  

    MEMCACHED_CONN_DISPATCH(sfd, threads[thread].thread_id);  

    if (write(threads[thread].notify_send_fd, "", 1) != 1) {  

        perror("Writing to thread notify pipe");  

    }  

}

可以清楚的看到，主線程首先創建了一個新的CQ_ITEM，然後通過round robin策略選擇了一個thread
並通過cq_push將這個CQ_ITEM放入了該線程的CQ隊列裏，那麼對應的workers線程是怎麼知道的呢

就是通過這個
write(threads[thread].notify_send_fd, "", 1）
向該線程管道寫了1字節數據，則該線程的libevent立即回調了thread_libevent_process方法（上面已經描述過）

然後那個線程取出item,註冊讀時間，當該條連接上有數據時，最終也會回調drive_machine方法，也就是
drive_machine方法的 case conn_read:等全部是workers處理的，主線程只處理conn_listening 建立連接這個

這部分代碼確實比較多，沒法全部貼出來，請大家參考源碼，最新版本1.2.6，我省去了很多優化的地方
比如，每個CQ_ITEM被malloc時會一次malloc很多個，以減小碎片的產生等等細節。

時間倉促，有紕漏的地方，歡迎大家拍磚。

Memcached源碼分析(線程模型)

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

中文分詞算法

爲何有些雲比其他雲更可信

二叉樹的遞歸算法

C++內存管理詳解

重讀經典-《Effective C++》Item2：儘量以const,enum,inline替換#define

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結