pr_emerg耗時,影響性能原理排查

概述

本文涉及到的性能debug方式如下:

  1. ftrace的使用。
  2. lock contention/lockdep的使用。
一、 問題來源以及debug思路

“在調試 vdsp的dvfs,發現notify執行dvfs callback的時候callback中執行會很慢,兩條log之間什麼都不作,耗時都可能在8ms左右,固定cpu/ddr頻率也不沒效果。”

debug思路如下:

  1. 確認在何種情況下導致的?
  2. 耗時原因,懷疑是持鎖時間過長導致的?
  3. 開啓debug feature定位原因。

二、問題復現

從camera owner提供的log來看,可以確定是pr_emerg答應log方式導致耗時較長,去掉這個log也驗證了是其導致的。這一個類型log一般使用在kernel panic或者串口log中。按照之前的項目經驗才確定是這種導致的性能下降,但是一直沒有去深究爲何?
在任意位置添加如下的log:

trace_printk("samarxie  11111\n");  
pr_emerg("samarxie func=%s enter\n",__func__);  
trace_printk("samarxie  22222\n");  
pr_err("samarxie pr_err====\n");  
trace_printk("samarxie  33333\n");  
pr_emerg("samarxie func=%s exit\n",__func__);  
trace_printk("samarxie  44444\n");  

只要編譯KO或者bootimage刷入手機看log輸出即可。
正常情況是所有log出現的時間順序應該間隔非常短,在us級別.但是從正常的log看,信息如下:

<0>[  373.701444] c0 samarxie func=dbs_check_cpu_xxx enter  
<3>[  373.706543] c0 samarxie pr_err====  
<0>[  373.706565] c0 samarxie func=dbs_check_cpu_xxx exit

可以看到

pr_emerg("samarxie func=%s enter\n",__func__);  
pr_err("samarxie pr_err====\n"); 

這兩個log時間耗時很長.
從trace可以看到如下信息:

sh-8278  [000] ..s.   373.701427: dbs_check_cpu_xxx: samarxie  11111    
sh-8278  [000] ..s.   373.706538: dbs_check_cpu_xxx: samarxie  22222 

也是耗時這麼長時間。
所以可以確定是pr_emerg的使用導致耗時過長,並且會影響性能.但是爲何呢?pr_emerg是log level爲0,而我們平臺的log level=1就會往串口輸出數據。

#define pr_emerg(fmt, ...) \
	printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)

#define KERN_EMERG	KERN_SOH "0"	/* system is unusable */

三、開啓lock dep/debug config進行debug

我繼續check到底是哪裏耗時這麼長呢?開啓lock trace debug feature,具體開啓如下的config:

#  
# Lock Debugging (spinlocks, mutexes, etc...)  
#  
-# CONFIG_DEBUG_RT_MUTEXES is not set  
-# CONFIG_DEBUG_SPINLOCK is not set  
-# CONFIG_DEBUG_MUTEXES is not set  
-# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set  
-# CONFIG_DEBUG_LOCK_ALLOC is not set  
-# CONFIG_PROVE_LOCKING is not set  
-# CONFIG_LOCK_STAT is not set  
+CONFIG_DEBUG_RT_MUTEXES=y  
+CONFIG_DEBUG_SPINLOCK=y  
+CONFIG_DEBUG_MUTEXES=y  
+CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y  
+CONFIG_DEBUG_LOCK_ALLOC=y  
+CONFIG_PROVE_LOCKING=y  
+CONFIG_LOCKDEP=y  
+CONFIG_LOCK_STAT=y  
+CONFIG_DEBUG_LOCKDEP=y  

開啓上面的之後,我們就可以trace lock的情況了.
在sys/kernel/debug/tracing/event/lock/目錄下存在如下的event:

/sys/kernel/debug/tracing/events/lock # ls -l  
total 0  
-rw-r--r-- 1 root root 0 1970-01-01 08:00 enable  
-rw-r--r-- 1 root root 0 1970-01-01 08:00 filter  
drwxr-xr-x 2 root root 0 1970-01-01 08:00 lock_acquire  
drwxr-xr-x 2 root root 0 1970-01-01 08:00 lock_acquired  
drwxr-xr-x 2 root root 0 1970-01-01 08:00 lock_contended  
drwxr-xr-x 2 root root 0 1970-01-01 08:00 lock_release  

之後使用如下的ftrace腳本,抓取復現場景的信息即可:

#!/bin/bash  
adb root  
adb remount  
#low memory,buffer size should be little. otherwise lose trace event  
adb shell "echo 60000 > /d/tracing/buffer_size_kb"  
  
#adb shell "echo sched_switch sched_wakeup sched_waking > /d/tracing/set_event"  
adb shell "echo lock> /d/tracing/set_event"  
  
adb shell "echo > /d/tracing/trace"  
adb shell "echo 0 > /d/tracing/tracing_on"  
sleep 1  
  
echo "start fetch trace informantion $1s"  
adb shell "echo 1 > /d/tracing/tracing_on"  
  
#set fetch trace timeout  
echo "do anything"  
sleep $1  
  
adb shell "echo 0 > /d/tracing/tracing_on"  
echo "fetch trace informantion end"  
echo "save trace to /data/trace.txt file"  
adb shell "cat /d/tracing/trace > /data/trace.txt"  
echo "all fetch over!"  
echo "start pull file in current dir"  
time=`date '+%T'`  
adb pull /data/trace.txt $time-trace.txt  
echo "pull file name is $time-trace.txt"  

就可以看到,是console_lock一直持鎖導致的:

sh-8278  [000] ..s.   373.701427: dbs_check_cpu_xxx: samarxie  11111                                                                                                 
                <0>[  373.701444] c0 samarxie func=dbs_check_cpu_xxx enter  
sh-8278  [000] d.s1   373.701477: lock_acquire: 00000000707bef2b console_owner  
sh-8278  [000] d.s1   373.706452: lock_release: 00000000707bef2b console_owner  
  
sh-8278  [000] ..s.   373.706538: dbs_check_cpu_xxx: samarxie  22222                                                                                                 
                <3>[  373.706543] c0 samarxie pr_err====  
sh-8278  [000] ..s.   373.706560: dbs_check_cpu_xxx: samarxie  33333  
  
        <0>[  373.706565] c0 samarxie func=dbs_check_cpu_xxx exit  
sh-8278  [000] d.s1   373.706584: lock_acquire: 00000000707bef2b console_owner  
sh-8278  [000] d.s1   373.711450: lock_release: 00000000707bef2b console_owner  
  
sh-8278  [000] ..s.   373.711619: dbs_check_cpu_xxx: samarxie  44444      

從上面可以知道:

  1. pr_emerg先將log存入log buffer之後,在寫入串口中,所以log先出現的dmesg裏面之後在持鎖console_lock
  2. 每次使用pr_emerg都會持鎖console_lock一次,並且耗時基本一致.
  3. 只有釋放lock之後後面的trace和pr_err信息纔會出現.

四、耗時分析

繼續定位爲何持鎖這麼長時間:

static struct lockdep_map console_owner_dep_map = {  
        .name = "console_owner"  
};  
/** 
 * console_lock_spinning_enable - mark beginning of code where another 
 *      thread might safely busy wait 
 * 
 * This basically converts console_lock into a spinlock. This marks 
 * the section where the console_lock owner can not sleep, because 
 * there may be a waiter spinning (like a spinlock). Also it must be 
 * ready to hand over the lock at the end of the section. 
 */  
static void console_lock_spinning_enable(void)  
{  
        raw_spin_lock(&console_owner_lock);  
        console_owner = current;  
        raw_spin_unlock(&console_owner_lock);  
  
        /* The waiter may spin on us after setting console_owner */  
        spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);  
}  
  
/** 
 * console_lock_spinning_disable_and_check - mark end of code where another 
 *      thread was able to busy wait and check if there is a waiter 
 * 
 * This is called at the end of the section where spinning is allowed. 
 * It has two functions. First, it is a signal that it is no longer 
 * safe to start busy waiting for the lock. Second, it checks if 
 * there is a busy waiter and passes the lock rights to her. 
 * 
 * Important: Callers lose the lock if there was a busy waiter. 
 *      They must not touch items synchronized by console_lock 
 *      in this case. 
 * 
 * Return: 1 if the lock rights were passed, 0 otherwise. 
 */  
static int console_lock_spinning_disable_and_check(void)  
{  
        int waiter;  
  
        raw_spin_lock(&console_owner_lock);  
        waiter = READ_ONCE(console_waiter);  
        console_owner = NULL;  
        raw_spin_unlock(&console_owner_lock);  
  
        if (!waiter) {  
                spin_release(&console_owner_dep_map, 1, _THIS_IP_);  
                return 0;  
        }  
  
        /* The waiter is now free to continue */  
        WRITE_ONCE(console_waiter, false);  
  
        spin_release(&console_owner_dep_map, 1, _THIS_IP_);  
  
        /* 
         * Hand off console_lock to waiter. The waiter will perform 
         * the up(). After this, the waiter is the console_lock owner. 
         */  
        mutex_release(&console_lock_dep_map, 1, _THIS_IP_);  
        return 1;  
}  
  
  
//在函數console_unlock中調用如下:  
                /* 
                 * While actively printing out messages, if another printk() 
                 * were to occur on another CPU, it may wait for this one to 
                 * finish. This task can not be preempted if there is a 
                 * waiter waiting to take over. 
                 */  
                console_lock_spinning_enable();  
  
                stop_critical_timings();        /* don't trace print latency */  
                call_console_drivers(ext_text, ext_len, text, len);  
                start_critical_timings();  
  
                if (console_lock_spinning_disable_and_check()) {  
                        printk_safe_exit_irqrestore(flags);  
                        goto out;  
                }  

繼續定位耗時的操作爲call_console_drivers函數. 那麼到底耗時操作在哪裏呢?定位函數耗時操作,可以抓取trace function graphic信息,腳本如下:

#!/bin/bash  
adb root  
adb remount  
#low memory,buffer size should be little. otherwise lose trace event  
adb shell "echo 60000 > /d/tracing/buffer_size_kb"  
  
#adb shell "echo sched_switch sched_wakeup sched_waking > /d/tracing/set_event"  
#adb shell "echo lock> /d/tracing/set_event"  
adb shell "echo nop > /d/tracing/set_event"  
adb shell "echo > /d/tracing/trace"  
adb shell "echo function_graph > /d/tracing/current_tracer"  
adb shell "echo console_unlock > /d/tracing/set_graph_function"  
adb shell "echo 0 > /d/tracing/tracing_on"  
sleep 1  
  
echo "start fetch trace informantion $1s"  
adb shell "echo 1 > /d/tracing/tracing_on"  
  
#set fetch trace timeout  
echo "do anything"  
sleep $1  
  
adb shell "echo 0 > /d/tracing/tracing_on"  
echo "fetch trace informantion end"  
echo "save trace to /data/trace.txt file"  
adb shell "cat /d/tracing/trace > /data/trace.txt"  
echo "all fetch over!"  
echo "start pull file in current dir"  
time=`date '+%T'`  
adb pull /data/trace.txt func-graphic-$time-trace.txt  
echo "pull file name is func-graphic-$time-trace.txt"  

具體各個函數耗時如下(trace中獲取的):

 0)               |  console_unlock() {
 0)               |    __printk_safe_enter() {
 0)   1.192 us    |      preempt_count_add();
 0)   0.384 us    |      preempt_count_sub();
 0) + 12.616 us   |    }
 0)               |    _raw_spin_lock() {
 0)   0.384 us    |      preempt_count_add();
 0)   0.423 us    |      do_raw_spin_trylock();
 0)   9.000 us    |    }
 0)               |    msg_print_text() {
 0) + 11.000 us   |      print_prefix();
 0)   4.962 us    |      print_prefix();
 0) + 24.692 us   |    }
 0)               |    _raw_spin_unlock() {
 0)   0.462 us    |      do_raw_spin_unlock();
 0)   0.385 us    |      preempt_count_sub();
 0)   7.961 us    |    }
 0)               |    console_lock_spinning_enable() { //console_owner加鎖
 0)               |      _raw_spin_lock() {
 0)   0.385 us    |        preempt_count_add();
 0)   0.423 us    |        do_raw_spin_trylock();
 0)   8.154 us    |      }
 0)               |      _raw_spin_unlock() {
 0)   0.462 us    |        do_raw_spin_unlock();
 0)   0.385 us    |        preempt_count_sub();
 0)   7.885 us    |      }
 0) + 25.077 us   |    }
 0)               |  /* samarxie  call_console_drivers start */
 0)               |    xxx_console_write() {
 0)               |      _raw_spin_lock_irqsave() {
 0)   0.384 us    |        preempt_count_add();
 0)   2.038 us    |        do_raw_spin_trylock();
 0)   9.654 us    |      }
 0)               |      uart_console_write() {
 0)               |        xxx_console_putchar() {
 0)   1.500 us    |          __delay();
 0)   6.808 us    |        }
 0)               |        xxx_console_putchar() {
 0)   1.462 us    |          __delay();
 0)   5.461 us    |        }
 0)               |        xxx_console_putchar() { //耗時函數
 0)   1.500 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.462 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.461 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.500 us    |          __delay();
 0) + 80.808 us   |        }
 0)               |        xxx_console_putchar() {
 0)   1.500 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.462 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.500 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.462 us    |          __delay();
 0)   1.423 us    |          __delay();
 0)   1.461 us    |          __delay();
 0) + 86.731 us   |        }
……………………………………………….
 0)   1.500 us    |      __delay();
 0)   1.423 us    |      __delay();
 0)               |      _raw_spin_unlock_irqrestore() {
 0)   0.462 us    |        do_raw_spin_unlock();
 0)   0.385 us    |        preempt_count_sub();
 0)   7.961 us    |      }
 0) # 4977.385 us |    }
 0)               |    pstore_console_write() {
 0)               |      __getnstimeofday64() {
 0)   0.500 us    |        arch_counter_read();
 0)   6.231 us    |      }
 0)               |      buffer_size_add() {
 0)               |        _raw_spin_lock_irqsave() {
 0)   0.423 us    |          preempt_count_add();
 0)   0.462 us    |          do_raw_spin_trylock();
 0)   8.961 us    |        }
 0)               |        _raw_spin_unlock_irqrestore() {
 0)   0.462 us    |          do_raw_spin_unlock();
 0)   0.385 us    |          preempt_count_sub();
 0)   8.346 us    |        }
 0) + 26.885 us   |      }
 0)               |      buffer_start_add() {
 0)               |        _raw_spin_lock_irqsave() {
 0)   0.423 us    |          preempt_count_add();
 0)   0.423 us    |          do_raw_spin_trylock();
 0)   9.154 us    |        }
 0)               |        _raw_spin_unlock_irqrestore() {
 0)   0.461 us    |          do_raw_spin_unlock();
 0)   0.385 us    |          preempt_count_sub();
 0)   7.923 us    |        }
 0) + 25.462 us   |      }
 0)   1.385 us    |      __memcpy_toio();
 0) + 88.423 us   |    }
 0)               |  /* samarxie  call_console_drivers end */
 0)               |    console_lock_spinning_disable_and_check() {//console_owner解鎖

可以看到是xxx_console_putchar耗時過長.
看起代碼:

static void wait_for_xmitr(struct uart_port *port)  
{  
        unsigned int status, tmout = 10000;  
  
        /* wait up to 10ms for the character(s) to be sent */  
        do {  
                status = serial_in(port, XXX_STS1);  
                if (--tmout == 0)  
                        break;  
              udelay(1);   //trace中看到的udelay執行
        } while (status & xxx_TX_FIFO_CNT_MASK);  
}  
  
static void xxx_console_putchar(struct uart_port *port, int ch)  
{  
        wait_for_xmitr(port);  
        serial_out(port, xxx_TXD, ch);  
}  

就算做實驗把udelay去掉,耗時也沒有減少,可以看到串口讀寫操作耗時多。
本身耗時在硬件的行爲。

所以概述的措施可能有:

  1. 提高波特率。
  2. 從poll修改爲dma+中斷方式。

最後,log的正確合理的使用對性能的影響還是蠻關鍵的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章