latencytop深度瞭解你的Linux系統的延遲

原創文章,轉載請註明: 轉載自系統技術非業餘研究


本文鏈接地址: latencytop深度瞭解你的Linux系統的延遲

我們在系統調優或者定位問題的時候,經常會發現多線程程序的效率很低,但是又不知道問題出在哪裏,就知道上下文切換很多,但是爲什麼上下文切換,是誰導致切換,我們就不知道了。上下文切換可以用dstat這樣的工具查看,比如:

$dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  9   2  87   2   0   1|7398k   31M|   0     0 | 9.8k   11k|  16k   64k
 20   4  69   3   0   4|  26M   56M|  34M  172M|   0     0 |  61k  200k
 21   5  64   6   0   3|  26M  225M|  35M  175M|   0     0 |  75k  216k
 21   5  66   4   0   4|  25M  119M|  34M  173M|   0     0 |  66k  207k
 19   4  68   5   0   3|  23M   56M|  33M  166M|   0     0 |  60k  197k
 
#或者用systemtap腳本來看
$sudo stap -e 'global cnt; probe scheduler.cpu_on {cnt<<<1;} probe timer.s(1){printf("%d\n", @count(cnt)); delete cnt;}'
217779
234141
234759

每秒高達200k左右的的上下文切換, 誰能告訴我發生了什麼? 好吧,latencytop來救助了!

它的官網:http://www.latencytop.org/

Skipping audio, slower servers, everyone knows the symptoms of latency. But to know what’s going on in the system, what’s causing the latency, how to fix it… that’s a hard question without good answers right now.

LatencyTOP is a Linux* tool for software developers (both kernel and userspace), aimed at identifying where in the system latency is happening, and what kind of operation/action is causing the latency to happen so that the code can be changed to avoid the worst latency hiccups.

它是Intel貢獻的另外一個性能查看器,還有一個是powertop,都是很不錯的工具.

Latencytop通過在內核上下文切換的時候,記錄被切換的進程的內核棧,然後通過匹配內核棧的函數來判斷是什麼原因導致上下文切換,同時他把幾十種容易引起切換的場景的函數都記錄起來,這樣在判斷系統問題的時候能容易定位到問題。

latencytop分成2個部分,內核部分和應用部分。內核部分負責調用棧的收集並且通過/proc來暴露, 應用部分負責顯示.

工作界面截圖如下:

latencytop在2.6.256後被內核吸收成爲其中一部分,只要編譯的時候打開該選項就好,如何確認呢?

$ cat /proc/latency_stats
Latency Top version : v0.1

看到這個就好了, 遺憾的是RHEL6竟然帶了latencytop應用部分,而沒有打開編譯選項,讓我們情何以堪呢?
在ubuntu下可以這麼安裝:

$ uname -r
2.6.38-yufeng
$ apt-get install latencytop
$ sudo latencytop #就可以使用了

但是latencytop比較傻的是默認是開圖像界面的,我們很不習慣,我們要文本界面, 自己動手把!

$ apt-get source latencytop
$ diff -up Makefile.orig Makefile
--- Makefile.orig    2011-03-29 20:10:29.025845447 +0800
+++ Makefile    2011-03-28 14:48:11.232318002 +0800
@@ -1,5 +1,5 @@
 # FIXME: Use autoconf ?
-HAS_GTK_GUI = 0
+#HAS_GTK_GUI = 0
  
 DESTDIR =
 SBINDIR = /usr/sbin

重新make下就好了, 文本界面出現了. 具體使用參看 man latencytop。

fcicq同學說:

加個 –nogui 參數就好了. 不需要重新編譯.

謝謝!

好了,那麼latencytop支持多少種的延遲原因呢?讓latencytop.trans告訴你,我們也可以自己修改這個文件,把新的延遲原因加上去。

$ cat /usr/share/latencytop/latencytop.trans
#
1    vfs_read        Reading from file
1    vfs_write        Writing to file
1    __mark_inode_dirty    Marking inode dirty
1    vfs_readdir        Reading directory content
1    vfs_unlink        Unlinking file
1    blocking_notifier_call_chain    Blocking notifier
1    lock_super        Superblock lock contention
1    vfs_create        Creating a file
1    KAS_ScheduleTimeout    Binary AMD driver delay
1    firegl_lock_device    Binary AMD driver delay
#
2    __bread            Synchronous buffer read
2    do_generic_mapping_read    Reading file data
2    sock_sendmsg        Sending data over socket
2    do_sys_open        Opening file
2    do_sys_poll        Waiting for event (poll)
2    core_sys_select        Waiting for event (select)
2    proc_reg_read        Reading from /proc file
2    __pollwait        Waiting for event (poll)
2    sys_fcntl        FCNTL system call
2    scsi_error_handler    SCSI error handler
2    proc_root_readdir    Reading /proc directory
2    ksoftirqd        Waking ksoftirqd
2    worker_thread        .
2    do_unlinkat        Unlinking file
2    __wait_on_buffer    Waiting for buffer IO to complete
2    pdflush            pdflush() kernel thread
2    kjournald        kjournald() kernel thread
2    blkdev_ioctl        block device IOCTL
2    kauditd_thread        kernel audit daemon
2    tty_ioctl        TTY IOCTL
2    __filemap_fdatawrite_range fdatasync system call
2    do_sync_write        synchronous write
2    kthreadd        kthreadd kernel thread
2    usb_port_resume        Waking up USB device
2    usb_autoresume_device    Waking up USB device
2    kswapd            kswapd() kernel thread
2    md_thread        Raid resync kernel thread
2    i915_wait_request    Waiting for GPU command to complete
2    request_module        Loading a kernel module
 
#
3    tty_wait_until_sent    Waiting for TTY to finish sending
3    pipe_read        Reading from a pipe
3    pipe_write        Writing to a pipe
3    pipe_wait        Waiting for pipe data
3    read_block_bitmap    Reading EXT3 block bitmaps
3    scsi_execute_req    Executing raw SCSI command
3    sys_wait4        Waiting for a process to die
3    sr_media_change        Checking for media change
3    sr_do_ioctl        SCSI cdrom ioctl
3    sd_ioctl        SCSI disk ioctl
3    sr_cd_check        Checking CDROM media present
3    ext3_read_inode        Reading EXT3 inode
3    htree_dirblock_to_tree    Reading EXT3 directory htree
3    ext3_readdir        Reading EXT3 directory
3    ext3_bread        Synchronous EXT3 read
3    ext3_free_branches    Unlinking file on EXT3
3    ext3_get_branch        Reading EXT3 indirect blocks
3    ext3_find_entry        EXT3: Looking for file
3    __ext3_get_inode_loc    Reading EXT3 inode
3    ext3_delete_inode    EXT3 deleting inode
3    sync_page        Writing a page to disk
3    tty_poll        Waiting for TTY data
3    tty_read        Waiting for TTY input
3    tty_write        Writing data to TTY
3    update_atime        Updating inode atime
3    page_cache_sync_readahead    Pagecache sync readahead
3    do_fork            Fork() system call
3    sys_mkdirat        Creating directory
3    lookup_create        Creating file
3    inet_sendmsg        Sending TCP/IP data
3    tcp_recvmsg        Receiving TCP/IP data
3    link_path_walk        Following symlink
3    path_walk        Walking directory tree
3    sys_getdents        Reading directory content
3    unix_stream_recvmsg    Waiting for data on unix socket
3    ext3_mkdir        EXT3: Creating a directory
3    journal_get_write_access    EXT3: Waiting for journal access
3    synchronize_rcu        Waiting for RCU
3    input_close_device    Closing input device
3    mousedev_close_device    Closing mouse device
3    mousedev_release    Closing mouse device
3    mousedev_open        Opening mouse device
3    kmsg_read        Reading from dmesg
3    sys_futex        Userspace lock contention
3    do_futex        Userspace lock contention
3    vt_waitactive        vt_waitactive IOCTL
3    acquire_console_sem    Waiting for console access
3    filp_close        Closing a file
3    sync_inode        (f)syncing an inode to disk
3    ata_exec_internal_sg    Executing internal ATA command
3    writeback_inodes    Writing back inodes
3    ext3_orphan_add     EXT3 adding orphan
3    ext3_mark_inode_dirty     EXT3 marking inode dirty
3    ext3_unlink         EXT3 unlinking file
3    ext3_create        EXT3 Creating a file
3    log_do_checkpoint    EXT3 journal checkpoint
3    generic_delete_inode    Deleting an inode
3    proc_delete_inode    Removing /proc file
3    do_truncate        Truncating file
3    sys_execve        Executing a program
3    journal_commit_transaction    EXT3: committing transaction
3    __stop_machine_run    Freezing the kernel (for module load)
3    sys_munmap        unmapping memory
3    sys_mmap        mmaping memory
3    sync_buffer        Writing buffer to disk (synchronous)
3    inotify_inode_queue_event    Inotify event
3    proc_lookup        Looking up /proc file
3    generic_make_request    Creating block layer request
3    get_request_wait    Creating block layer request
3    alloc_page_vma        Allocating a VMA
#3    __d_lookup        Looking up a dentry
3    blkdev_direct_IO    Direct block device IO
3    sys_mprotect        mprotect() system call
3    shrink_icache_memory    reducing inode cache memory footprint
3    vfs_stat_fd        stat() operation
3    cdrom_open        opening cdrom device
3    sys_epoll_wait        Waiting for event (epoll)
3    sync_sb_inodes        Syncing inodes
3    tcp_connect        TCP/IP connect
3    ata_scsi_ioctl        ATA/SCSI disk ioctl
3    do_rmdir        Removing directory
3    vfs_rmdir        Removing directory
3    sys_flock        flock() on a file
3    usbdev_open        opening USB device
3    lock_kernel        Big Kernel Lock contention
3    blk_execute_rq        Submitting block IO
3    scsi_cmd_ioctl        SCSI ioctl command
3    acpi_ec_transaction    ACPI hardware access
3    journal_get_undo_access    Waiting for EXT3 journal undo operation
3    i915_irq_wait        Waiting for GPU interrupt
3    i915_gem_throttle_ioctl    Throttling GPU while waiting for commands
 
#
#
5    do_page_fault        Page fault
5    handle_mm_fault        Page fault
5    filemap_fault        Page fault
5    sync_filesystems    Syncing filesystem
5    sys_nanosleep        Application requested delay
5    sys_pause        Application requested delay
5    evdev_read        Reading keyboard/mouse input
5    do_fsync        fsync() on a file (type 'F' for details)
5    __log_wait_for_space    Waiting for EXT3 journal space

延遲原因非常的詳細.

本來到這裏,我要介紹的要介紹了,但是且慢,由於這個東西要在2.6.26後的系統上使用,我們的線上系統大部分是RHEL 5U4, 2.6.18的, 我們如何使用呢?

這時候 systemtap 一如既往的前來救助了!

systemtap 1.4版本以後帶了個latencytop.stp, 也是intel的貢獻. 那我們試驗下窮人家的latencytop.
它在那裏呢?

$ uname -r
2.6.18-164.el5
 
$ stap -V
Systemtap translator/driver (version 1.5 /0.137 non-git sources)
Copyright (C) 2005-2011 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
enabled features: AVAHI LIBRPM LIBSQLITE3 NSS BOOST_SHARED_PTR TR1_UNORDERED_MAP NLS
 
$ ls -al /usr/share/doc/systemtap/examples/profiling/latencytap.stp
-rwxr-xr-x 1 chuba users 16240 Feb 17 22:02/usr/share/doc/systemtap/examples/profiling/latencytap.stp
 
$ sudo stap -t --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
ERROR: Skipped too many probes, check MAXSKIPPED or try again with stap -t for more details.
WARNING: Number of errors: 0, skipped probes: 101
WARNING: Skipped due to global 'dequeue' lock timeout: 2
WARNING: Skipped due to global 'this_sleep' lock timeout: 99
----- probe hit report:
kernel.trace("deactivate_task")!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:47:1), hits: 254, cycles: 680min/43327avg/2248467max, from: kernel.trace("deactivate_task")
kernel.trace("activate_task")!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:58:1), hits: 255, cycles: 890min/502549avg/2271568max, from: kernel.trace("activate_task")
kernel.function("finish_task_switch@kernel/sched.c:1969")?, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:78:7), hits: 509, cycles: 213min/1002207avg/5382852max, from: kernel.function("finish_task_switch") from: scheduler.cpu_on
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:123:1), hits: 1, cycles: 1802min/1802avg/1802max, from: begin
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:131:1), hits: 1, cycles: 227979min/227979avg/227979max, from: begin
Pass 5: run failed.  Try again with another '--vp 00001' option.

出錯了! 原因是lock timeout, 原來stap的全局變量是用鎖保護的,現在超時了!知道原因好辦,打個patch吧!

$ diff -up translate.cxx.orig  translate.cxx        
--- translate.cxx.orig     2011-03-22 21:26:52.000000000 +0800
+++ /translate.cxx     2011-03-29 20:31:28.000000000 +0800
@@ -5802,10 +5802,10 @@ translate_pass (systemtap_session& s)
       s.op->newline() << "#define MAXACTION_INTERRUPTIBLE (MAXACTION * 10)";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef TRYLOCKDELAY";
-      s.op->newline() << "#define TRYLOCKDELAY 10 /* microseconds */";
+      s.op->newline() << "#define TRYLOCKDELAY 50 /* microseconds */";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef MAXTRYLOCK";
-      s.op->newline() << "#define MAXTRYLOCK 100 /* 1 millisecond total */";
+      s.op->newline() << "#define MAXTRYLOCK 500 /* 1 millisecond total */";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef MAXMAPENTRIES";
       s.op->newline() << "#define MAXMAPENTRIES 2048";
 
#編譯安裝後再來一次
$ sudo stap  --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp  
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
Pass 5: run failed.  Try again with another '--vp 00001' option.
 
#又錯了,這次原因是probe overhead exceeded threshold, 看下代碼我們知道,腳本的開銷太大了,超過正常的負載,通過查看代碼可以用STP_NO_OVERLOAD來解除這個限制
 
#再來一次
$ sudo  stap -DSTP_NO_OVERLOAD --all-modules -DMAXSKIPPED=1024 /usr/share/doc/systemtap/examples/profiling/latencytap.stp
 
Reason                                  Count  Average(us)  Maximum(us) Percent%
Userspace lock contention                 345     16409195     83258717      45%
                                         1453       867513     60231852      10%
                                           95      7391754     33821926       5%
migration() kernel thread                1733       402701      3571412       5%
                                         7239        87993       401552       5%
Reading from a pipe                       212      2922207     52151180       4%
                                          142      2267850     17990214       2%
                                          108      2457247      7494331       2%
Waking ksoftirqd                           16     16082822     59266312       2%
Waiting for event (select)                 99      2113310     28510974       1%
kjournald() kernel thread                 148      1313447     13983084       1%
Application requested delay                94      1059898     10011409       0%
                                           41      2391993      7618788       0%
Waiting for event (select)                 38      2259444     29057362       0%
                                          719        92947       584944       0%
Waiting for event (poll)                    1     57582711     57582711       0%
Application requested delay                 3     19030709     36000553       0%
Waiting for event (select)                 39      1341880      5847683       0%
                                           34       936628      6649350       0%
                                            5      6163603     10008484       0%
...

這次看到結果了,哈哈,小高興一把。但是在繁忙的系統上這個腳本的資源佔用特別多,也是不爽的。 幸運的是這個腳本支持查看某個進程的延遲情況, 就是在 latencytap.stp 後面加個-x 參數。

這個腳本設計應該是支持進程ID, 但是結果寫成了線程ID,屬於bug!!!

動手改下吧:

$ diff -u latencytap.stp.orig  latencytap.stp 
---  latencytap.stp.orig    2011-02-17 22:02:40.000000000 +0800
+++ latencytap.stp     2011-03-29 20:43:51.000000000 +0800
@@ -15,7 +15,7 @@
 global this_sleep;
 global debug = 0;
  
-function log_event:long (p:long) { return (!traced_pid || traced_pid == p) }
+function log_event:long (p:long) { return (!traced_pid || traced_pid == task_pid(p)) }
  
 #func names from hex addresses
 function func_backtrace:string (ips:string)
@@ -50,14 +50,14 @@
   # check to see if task is in appropriate state:
   # TASK_INTERRUPTIBLE      1
   # TASK_UNINTERRUPTIBLE    2
if (log_event($p->pid) && (s & 3)) {
if (log_event($p) && (s & 3)) {
     dequeue[$p] = gettimeofday_us();
   }
 }
  
 probe kernel.trace("activate_task") !,
       kernel.function("activate_task") {
if (!log_event($p->pid)) next
if (!log_event($p)) next
  
   a = gettimeofday_us()
   d = dequeue[$p]
 
#再來一次
$ sudo stap  --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp -x $$
...
 
#如果發現出來的Reason是空行, 就把latencytap.stp裏面的debug=0, 改成debug=1

這下終於爽了,舊內核用systemtap版本的,新內核用內核版本的,世界和諧!

通過對線上MySQL的診斷髮現大部分時間花在mutex鎖的競爭上來,我說過了,我會收拾你的,等着瞧!

玩得開心!



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章