一個奇葩bug的解決

公司做的網絡視頻監控產品正在做測試，這兩天測試人員報告說多臺設備出現奇怪的問題，現象如下：
（1）PC端接收不到設備端應用程序採集通過網絡發送的圖像
（2）PC端可以ping通設備端，telnet可以登錄設備，設備ping PC端只能通一個數據包

於是我telnet登錄異常設備，通過tftp http ftp上傳下載文件到PC，發現都正常。

起初懷疑是設備網卡driver出現問題，但是細想網卡driver處於數據鏈路層，

上層（不管應用層是哪種協議）傳下來的數據包對於driver來說是一樣的。tftp ftp http能正常工作，說明網卡driver能正常收發數據， ping應該也能正常工作纔對。

仔細看ping返回的結果，發現有一些不一樣的地方，如下：

# ping 10.0.14.198
PING 10.0.14.198 (10.0.14.198): 56 data bytes
64 bytes from 10.0.14.198: seq=0 ttl=127 time=0.817 ms
^C
--- 10.0.14.198 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.817/0.817/0.817 ms

ping通一個icmp包，之後就沒有反應，ctrl+c程序退出顯示沒有丟包。
從現象上來看，並不是ping出現丟包，而是感覺ping好像在完成一次數據包通訊後阻塞。
想到這裏，就不再懷疑是網卡driver的問題，想再找找看系統其他的異常現象。

ps查看系統進程時發現異常設備pid已經到了2萬多，並且不再增加，而對比正常設備pid在1萬多，pid還在增加。
於是想是不是異常設備的圖像採集程序產生進程過多，系統進程數到達最大值，無法創建進程導致這個bug，
寫一段測試代碼如下：

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define MAXPROCESS 65535
#define SLEEPTIME 1
void main (int argc , char ** argv)
{
    pid_t pid;
    int count = 0;
    int maxprocess = MAXPROCESS;

    if (argc == 2) {
        maxprocess = atoi(argv[1]);
    }

    for (count = 0; count < maxprocess; count++)
    {
        pid = fork();
        if (pid < 0) {
            perror("fork error");
            exit(1);
        } else if (pid == 0) {
            printf("child %d start\n", count);
            sleep(SLEEPTIME);
            printf("child %d end\n", count);
            exit(0);
        }
        printf("parent:create %d child\n", count);
    }

    for (count = 0; count < MAXPROCESS; count++) {
        wait();
    }
    exit(0);
}

創建指定個數子進程sleep 1s，父進程等待子進程退出後回收資源然後退出。
編譯運行，結果如下：

# ./fork_test 1
parent:create 0 child
child 0 start
^C
#

發現程序不退出，根據打印發現是子進程sleep沒有退出，咦，這是什麼情況。
直接在console下執行sleep命令，發現也是阻塞不退出，只能ctrl+c退出。
這是該bug的另一個現象：sleep阻塞。
查看系統的pid限制，如下：

# cat /proc/sys/kernel/pid_max
32768

觀察正常設備pid，發現系統pid在達到32768後會從0-32768中再找已釋放的pid使用。
所以也不再懷疑是系統進程數限制，還得再找其他線索。

date查看系統時間與hwclock獲取的RTC時間對比，發現系統時間跟RTC時間差距較大，但是kernel啓動加載RTCdriver後會同步系統時間和RTC時間，系統時間與RTC時間應該一致呀？
觀察找到原因，發現又一個重要bug現象，系統時間走的是準的，晚於RTC時間的原因是，系統時間在某個時間值上往前走180s左右就會回跳回來再往前走，來回循環，導致系統時間晚於RTC時間！

到這裏，關於這個bug已經發現4種現象：
（1）PC端接收不到設備端應用程序採集通過網絡發送的圖像
（2）PC端可以ping通設備端，telnet可以登錄設備，設備ping PC端只能通一個數據包
（3）設備端sleep會阻塞

（4）設備端date系統時間走180s回跳

直覺感覺要從date這個現象入手，首先找date命令實現，嵌入式設備文件系統中使用的是busybox，其中有簡化的date命令，也可以找glibc庫來查看完整版本的date命令。這裏不再詳述date實現，date最終是調用gettimeofday來獲取時間。

gettimeofday調用sys_gettimeofday，是kernel的系統調用，kernel產生軟中斷，進入內核態，根據系統調用號跳轉到sys_gettimeofday，調用do_gettimeofday,如下：

void do_gettimeofday(struct timeval *tv)
{
    struct timespec now;

    getnstimeofday(&now);
    tv->tv_sec = now.tv_sec;
    tv->tv_usec = now.tv_nsec/1000;
}

void getnstimeofday(struct timespec *ts)
{
    unsigned long seq;
    s64 nsecs;

    WARN_ON(timekeeping_suspended);

    do {
        seq = read_seqbegin(&timekeeper.lock);

        *ts = timekeeper.xtime;
        nsecs = timekeeping_get_ns();

        /* If arch requires, add in gettimeoffset() */
        nsecs += arch_gettimeoffset();

    } while (read_seqretry(&timekeeper.lock, seq));

    timespec_add_ns(ts, nsecs);
}

static inline s64 timekeeping_get_ns(void)
{
    cycle_t cycle_now, cycle_delta;
    struct clocksource *clock;

    /* read clocksource: */
    clock = timekeeper.clock;
    cycle_now = clock->read(clock);

    /* calculate the delta since the last update_wall_time: */
    cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;

    /* return delta convert to nanoseconds using ntp adjusted mult. */
    return clocksource_cyc2ns(cycle_delta, timekeeper.mult,
                  timekeeper.shift);
}

do_gettimeofday調用getnstimeofday,最關鍵的是timerkeeper.xtime，這是kernel的牆上時間，xtime的更新是在kernel下clockevent註冊的時鐘中斷，只要kernel時鐘中斷正常，xtime時間就會不斷被更新。
但是由於kernel一般是1/100s產生一次時鐘中斷（kernel配置默認爲100HZ），當然對於tickless sysytem，時鐘中斷不固定，但是精度都不夠高。
爲了提高時鐘精度，調用timekeeping_get_ns，使用已註冊clocksource提供的read函數，來獲取距上次update xtime的時間，來作爲xtime的補充時間，提高精度。
kernel下xtime的更新和獲取機制有時間還需要仔細研究下，這裏先說這些。
kernel下xtime的操作流程如下：

gettimeofday <===獲取=== xtime <===更新=== clockevent clocksource

現在date出現時間回跳，是哪一步出了問題呢，從上面這個流程我想到有3種可能：
（1）gettimeofday時獲取xtime出錯
（2）xtime存儲出錯
（3）更新xtime出錯
如何排除，想到了二分法，如果能夠直接獲取xtime值，與gettimeofday獲取的值對比，就可以確定到底是哪一步出了問題。

首先要說明下xtime，在kernel源碼的kernel/time/timekeepering.c中定義了struct timekeepering結構體用來表徵kernel下與時間相關內容，其中就有xtime成員，結構體定義如下：

struct timespec {
    __kernel_time_t tv_sec;         /* seconds */
    long        tv_nsec;        /* nanoseconds */
};

tv_sec和tv_nsec表示了從1970-1-1以來的時間。

但是xtime並沒有留接口給系統調用等，無法從用戶空間來直接獲取xtime，並且該bug復現難，需要設備運行很長時間，因此也不能修改kernel後再重新啓動。
那怎麼辦，想到了一個辦法：driver module + application。
在timekeeping.c中也找到了kernel下來獲取xtime的接口，如下：

unsigned long get_seconds(void)
{
    return timekeeper.xtime.tv_sec;
}
EXPORT_SYMBOL(get_seconds);

於是編寫如下模塊代碼：

#include <linux/mm.h>
#include <linux/miscdevice.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/mman.h>
#include <linux/random.h>
#include <linux/init.h>
#include <linux/raw.h>
#include <linux/tty.h>
#include <linux/capability.h>
#include <linux/ptrace.h>
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/crash_dump.h>
#include <linux/backing-dev.h>
#include <linux/bootmem.h>
#include <linux/splice.h>
#include <linux/pfn.h>
#include <linux/export.h>

#include <asm/uaccess.h>
#include <asm/io.h>

#define GET_XTIME 0


static int dev_open(struct inode *inode, struct file *filp)
{
    return 0;
}

static ssize_t dev_read(struct file *file, char __user *buf,
            size_t count, loff_t *ppos)
{
    return 0;
}

static ssize_t dev_write(struct file *file, const char __user *buf,
             size_t count, loff_t *ppos)
{
    return 0;
}

static long dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
    int __user *argp = (int __user *)arg;
    unsigned long now = 0;

    switch (cmd) {
        case GET_XTIME :
            now = get_seconds();
            if (copy_to_user(argp, &now, 4))
                return -EFAULT;
        break;
        default :
            return -EFAULT;
    }

    return 0;
}

static const struct file_operations dev_fops = {
    .read       = dev_read,
    .write      = dev_write,
    .open       = dev_open,
    .unlocked_ioctl = dev_ioctl,
};

static struct cdev char_dev;
static int major;

static int __init char_dev_init(void)
{
    int rc;
    int err;
    dev_t devid;

    rc = alloc_chrdev_region(&devid, 0, 1, "char_dev");
    if (rc != 0)
    {
        printk("alloc chardev region failed\n");
        return -1;
    }
    major = MAJOR(devid);

    cdev_init(&char_dev, &dev_fops);
    cdev_add(&char_dev, devid, 1);

    return 0;
}

static void __exit char_dev_exit(void)
{
    cdev_del(&char_dev);
    unregister_chrdev_region(MKDEV(major,0), 1);
}

module_init(char_dev_init);
module_exit(char_dev_exit);

編譯此模塊，insmod插入kernel中，接着編寫一個應用程序如下：

#include <stdio.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <fcntl.h>

#define GET_XTIME 0

void main(void)
{
    unsigned long now = 0;

    struct timeval tv;
    struct timezone tz;

    gettimeofday(&tv, &tz);
    printf("tv.tv_sec = %d\n", tv.tv_sec);
    printf("tv.tv_usec = %d\n", tv.tv_usec);

    int fd = open("/dev/char_dev", O_RDWR);
    if (fd < 0)
    {
        printf("open failed\n");
        return;
    }

    ioctl(fd, GET_XTIME, &now);

    printf("xtime.tv_sec = %d\n", now);

    close(fd);
}

分別將gettimeofday獲取的時間和kernel中xtime的時間打印出來。
編譯程序，在kernel下運行3次，如下：

# ./dev_tool
tv.tv_sec = 1427286832
tv.tv_usec = 617831
xtime.tv_sec = 1427286754
#
#
# ./dev_tool
tv.tv_sec = 1427286835
tv.tv_usec = 17649
xtime.tv_sec = 1427286754
#
#
# ./dev_tool
tv.tv_sec = 1427286840
tv.tv_usec = 281584
xtime.tv_sec = 1427286754

很明顯可以看出，xtime的時間是停止的，那爲什麼gettimeofday時間還會走呢？
上面分析過gettimeofday實現，爲了提高精度，gettimeofday的時間 = xtime + 根據clocksource->read獲取的cycles換算出來的補充時間

那就來看下我們設備中註冊的clocksource什麼樣，如下：

static cycle_t
timer_get_cycles( struct clocksource *cs )
{
    return __raw_readl( IO_ADDRESS( REG_TIMER_TMR2DL ));
}

static struct clocksource timer_clocksource =
{
    .name   = MTAG_TIMER,
    .rating = 300,
    .read   = timer_get_cycles,
    .mask   = CLOCKSOURCE_MASK( 32 ),
    .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
};
static u32 notrace
update_sched_clock( void )
{
    return __raw_readl(IO_ADDRESS( REG_TIMER_TMR2DL ));
}

static int __init
timer_clocksource_init( void )
{
    u32 val = 0, mode = 0;

    timer_stop( 2 );
    __raw_writel( 0xffffffff, IO_ADDRESS( REG_TIMER_TMR2TGT ));      // free-running timer as clocksource
    val = __raw_readl( IO_ADDRESS( REG_TIMER_TMRMODE ));
    mode = ( val & ~( 0x0f << TIMER2_MODE_OFFSET )) | TIMER2_CONTINUOUS_MODE;
    __raw_writel( mode, IO_ADDRESS( REG_TIMER_TMRMODE ));
    timer_start( 2 );

    setup_sched_clock( update_sched_clock, 32, 24000000 );
    if(clocksource_register_hz( &timer_clocksource, 24000000 ))
    {
        panic("%s: can't register clocksource\n", timer_clocksource.name);
    }

    return 0;
}

可以看出根據該clcoksource->read獲取的最大cycles爲0xffffffff，而timer的工作頻率是24MHZ，因此換算成時間就是178.9s。

看到這個數字我一下就興奮了，因爲前面說過的一個bug現象，date時間走3分鐘就跳轉回來。這裏就可以解釋這個現象了：
kernel下的xtime時間停止了，但是由於clocksource的補充精度時間最大可以補充178.9s，
所以gettimeofday獲取時間就在xtime基礎上最多走178.9s，溢出後重新從0開始計數，時間又回到xtime重新開始！

那麼問題來了，爲什麼xtime停止更新了呢？
xtime的更新是基於kernel時鐘中斷，具體函數還是在timekeepering.c中的update_wall_time,產生一次時鐘中斷就會將新增的時間加在xtime上。
難道是沒有時鐘中斷了?
多次查看/proc/interrupts(涉及公司設備內容，這裏就不貼了)，發現timer的中斷果然沒有變化啊！

利用kernel下預留的操作寄存器命令查看timer模塊寄存器，發現產生時鐘中斷的timer的狀態寄存器是stop。

爲了驗證是timer中斷沒有了導致該bug，首先在出現該bug的設備直接再次start timer，發現設備恢復正常。而在正常設備上stop timer，就會復現該bug。這就說明無timer intr就是該bug的本質！

從最開始懷疑網卡driver有問題，經過一連串的推測驗證後，終於確定了引起該bug的原因：timer interrupt沒有了！
算是有一個階段性小勝利，哈哈。

但是問題還沒有最終解決，kernel代碼中哪裏導致了timer stop呢？
首先想到要讓timer stop，軟件只可能去置位stop寄存器。那隻需要找出kernel中stop timer的接口，確定哪裏會調用它，就可以縮小問題範圍了。kernel中timer intr的代碼如下：

static void
timer_set_mode( enum clock_event_mode mode, struct clock_event_device *evt )
{
    u32 val = 0, timermode = 0;
    val = __raw_readl( IO_ADDRESS( REG_TIMER_TMRMODE ));
    switch( mode )
    {
    case CLOCK_EVT_MODE_PERIODIC:
        timer_stop( 1 );
        timermode = ( val & ~( 0x0f << TIMER1_MODE_OFFSET )) | TIMER1_PERIODICAL_MODE;
        __raw_writel( TIMER1_TARGET, IO_ADDRESS( REG_TIMER_TMR1TGT ));
        __raw_writel( timermode, IO_ADDRESS( REG_TIMER_TMRMODE ));
        timer_start( 1 );
        break;
    case CLOCK_EVT_MODE_ONESHOT:
        timer_stop( 1 );
        timermode = ( val & ~( 0x0f << TIMER1_MODE_OFFSET )) | TIMER1_ONE_SHOT_MODE;
        __raw_writel( TIMER1_TARGET, IO_ADDRESS( REG_TIMER_TMR1TGT ));
        __raw_writel( timermode, IO_ADDRESS( REG_TIMER_TMRMODE ));
        timer_start( 1 );
        break;
    case CLOCK_EVT_MODE_UNUSED:
    case CLOCK_EVT_MODE_SHUTDOWN:
    default:
        VLOGD( MTAG_TIMER, "time stop and clr src pnd. mode = %d", mode );
        timer_stop(1);
        timer_clr_pnd(1);
        VLOGD( MTAG_TIMER, "REG_TIMER_TMREN is %u; REG_TIMER_TMRPND is %u", \
                            readl(IO_ADDRESS( REG_TIMER_TMREN )), readl(IO_ADDRESS( REG_TIMER_TMRPND )));
        break;
    }

}

static int
timer_set_next_event( unsigned long cycles, struct clock_event_device *evt )
{
    timer_stop( 1 );
    __raw_writel( cycles, IO_ADDRESS( REG_TIMER_TMR1TGT ));
    timer_start( 1 );

    return 0;
}

static struct clock_event_device timer_clockevent =
{
    .name           = MTAG_TIMER,
    .features       = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT,
    .rating         = 200,
    .set_mode       = timer_set_mode,
    .set_next_event = timer_set_next_event,
};
static void __init
timer_clockevent_init( void )
{
    clockevents_calc_mult_shift( &timer_clockevent, CLOCK_TICK_RATE, 4 );

    timer_clockevent.max_delta_ns = clockevent_delta2ns( 0xffffffff, &timer_clockevent );
    timer_clockevent.min_delta_ns = clockevent_delta2ns( CLOCKEVENT_MIN_DELTA, &timer_clockevent );
    timer_clockevent.cpumask      = cpumask_of( 0 );
    clockevents_register_device( &timer_clockevent );
}
....
static void
timer_clock_event_interrupt( void )
{
       struct clock_event_device *evt = &timer_clockevent;

       timer_clr_pnd( 1 );
       evt->event_handler( evt );
}


static irqreturn_t
timer_interrupt( int irq, void *dev_id )
{
    u32 srcpnd = 0;
    struct clock_event_device *evt = &timer_clockevent;

    srcpnd = __raw_readl(IO_ADDRESS( REG_TIMER_TMRPND ));
    if( srcpnd & TIMER1_EVENT )
    {
    	timer_clock_event_interrupt();
    }

    __raw_writel( srcpnd, IO_ADDRESS( REG_TIMER_TMRPND ));

    return IRQ_HANDLED;
}

static struct irqaction timer_irq =
{
    .name    = "timer",
    .flags   = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
    .handler = timer_interrupt,
    .dev_id  = NULL,
};

這裏只貼出clockevent和timer irq處理相關的部分代碼，可以看出涉及到stop timer只有set_next_event和set_mode中，set_next_event會在timer_interrupt中的evt->event_handler中調用，來設置下次觸發intr的時間點，set_mode來設置timer的工作模式。
直覺感覺，set_mode應該只在timer初始化時使用，而set_next_event會在每次timer intr中使用。因此想在set_mode中加打印來看下哪裏會調用set_mode（猜測set_mode調用少，set_next_event中不能加打印，因爲timer intr太多）。

set_mode加打印後重新編譯kernel，在一臺設備上啓動發現set_mode只會在kernel啓動中調用，進入console後就不會調用了。
這樣其實就排除了set_mode函數的可能性，因爲根據觀察timer intr停止的時間，都是在用戶空間，並且kernel啓動中printk打印的時間戳是正常的。
在kernel啓動後用戶空間發生timer intr停止，軟件上來看，只可能是timer中斷部分出現問題了。
但是看代碼，timer_interrupt中也沒有stop timer的操作啊。

不過還是想到了一種場景會導致stop timer現象：

timer_interrupt中調用timer_clock_event_interrupt，其中又調用clockevent->event_handler，該函數會調用clockevent->set_next_event，
在set_next_event中設置完下一次觸發時間點後就start timer了，回到timer_interrupt中clear下timer intr就會從中斷處理中退出。
如果設置的下次觸發時間點足夠短（kernel爲tickless，每次設置的觸發時間都不一樣），在clear timer intr之前該次intr就產生了，但是接下來就clear掉了。

這樣中斷處理函數退出後就不會再次產生timer intr了。

但是有2個地方我覺得需要驗證下：

（1）timer計數達到目標後，狀態寄存器是否是stop狀態
（2）如果是上述場景導致這次bug，那麼延長start timer和clear intr之間時間，應該會讓該bug更快復現
timer經過測試發現計數達到目標後，狀態寄存器就會顯示爲stop狀態。
在time_interrupt中的start timer 和clear intr中間加入一些沒用代碼做延時（不能用delay，因爲現在timer有問題呢），進入console後很快就復現了bug。

所以這個bug的原因就是不應該在set_next_event後再次clear timer intr。於是將上面time_interrupt修改爲：

static irqreturn_t
timer_interrupt( int irq, void *dev_id )
{
    u32 srcpnd = 0;
    struct clock_event_device *evt = &timer_clockevent;

    srcpnd = __raw_readl(IO_ADDRESS( REG_TIMER_TMRPND ));
    if( srcpnd & TIMER1_EVENT )
    {
        timer_clr_pnd( 1 );
        evt->event_handler( evt );
    }

    return IRQ_HANDLED;
}

終於解決了這個bug，對於sleep ping的阻塞問題也就可以理解了：

sleep ping實現中都使用了定時器，定時器是由kernel的時鐘中斷和軟中斷結合實現的，由時鐘中斷來觸發定時器軟中斷，在軟中斷中檢查定時是否到了，所以沒有了時鐘中斷，kernel的定時器機制不能工作。

根據kernel代碼，kernel調度的時間片是使用ktime_get接口獲取的，該函數獲取的時間也是使用了clocksource提供的精度補充時間，時間是180s來回跳轉，也是走的，所以kernel的調度還是正常的。

記錄這次bug調試，並沒有詳細來說明一些知識的細節，而是重在說明思路，如何從最初懷疑網卡driver問題，一步步分析排查，直到最後徹底找到代碼原因。

我想如果要成爲一個系統級程序員，解決類似的bug，開闊的思路比知識更重要！

一個奇葩bug的解決

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

一個奇葩bug的解決

內核中斷號必須要跟硬件中斷號一致嗎

熟悉又陌生的udelay

嵌入式設備的網絡性能該如何分析

對於字節序小端和大端的思考

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結