原創文章,轉載請註明: 轉載自系統技術非業餘研究
本文鏈接地址:
latencytop深度瞭解你的Linux系統的延遲
我們在系統調優或者定位問題的時候,經常會發現多線程程序的效率很低,但是又不知道問題出在哪裏,就知道上下文切換很多,但是爲什麼上下文切換,是誰導致切換,我們就不知道了。上下文切換可以用dstat這樣的工具查看,比如:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- |
usr sys idl wai hiq siq| read
writ| recv send| in
out | int csw |
9 2 87 2 0 1|7398k 31M| 0 0 | 9.8k 11k| 16k 64k |
20 4 69 3 0 4| 26M 56M| 34M 172M| 0 0 | 61k 200k |
21 5 64 6 0 3| 26M 225M| 35M 175M| 0 0 | 75k 216k |
21 5 66 4 0 4| 25M 119M| 34M 173M| 0 0 | 66k 207k |
19 4 68 5 0 3| 23M 56M| 33M 166M| 0 0 | 60k 197k |
$ sudo
stap -e 'global cnt; probe scheduler.cpu_on {cnt<<<1;} probe timer.s(1){printf("%d\n", @count(cnt)); delete cnt;}' |
每秒高達200k左右的的上下文切換, 誰能告訴我發生了什麼? 好吧,latencytop來救助了!
它的官網:http://www.latencytop.org/
Skipping audio, slower servers, everyone knows the symptoms of latency. But to know what’s going on in the system, what’s causing the latency, how to fix it… that’s a hard question without good answers right now.
LatencyTOP is a Linux* tool for software developers (both kernel and userspace), aimed at identifying where in the system latency is happening, and what kind of operation/action is causing the latency to happen so that the code can be changed to avoid the
worst latency hiccups.
它是Intel貢獻的另外一個性能查看器,還有一個是powertop,都是很不錯的工具.
Latencytop通過在內核上下文切換的時候,記錄被切換的進程的內核棧,然後通過匹配內核棧的函數來判斷是什麼原因導致上下文切換,同時他把幾十種容易引起切換的場景的函數都記錄起來,這樣在判斷系統問題的時候能容易定位到問題。
latencytop分成2個部分,內核部分和應用部分。內核部分負責調用棧的收集並且通過/proc來暴露, 應用部分負責顯示.
工作界面截圖如下:
latencytop在2.6.256後被內核吸收成爲其中一部分,只要編譯的時候打開該選項就好,如何確認呢?
$ cat
/proc/latency_stats |
Latency Top version : v0.1 |
看到這個就好了, 遺憾的是RHEL6竟然帶了latencytop應用部分,而沒有打開編譯選項,讓我們情何以堪呢?
在ubuntu下可以這麼安裝:
$ apt-get install
latencytop |
但是latencytop比較傻的是默認是開圖像界面的,我們很不習慣,我們要文本界面, 自己動手把!
$ apt-get source
latencytop |
$ diff
-up Makefile.orig Makefile |
--- Makefile.orig 2011-03-29 20:10:29.025845447 +0800 |
+++ Makefile 2011-03-28 14:48:11.232318002 +0800 |
重新make下就好了, 文本界面出現了. 具體使用參看 man latencytop。
fcicq同學說:
加個 –nogui 參數就好了. 不需要重新編譯.
謝謝!
好了,那麼latencytop支持多少種的延遲原因呢?讓latencytop.trans告訴你,我們也可以自己修改這個文件,把新的延遲原因加上去。
$ cat
/usr/share/latencytop/latencytop.trans |
1 vfs_read Reading from
file |
1 vfs_write Writing to file |
1 __mark_inode_dirty Marking inode dirty |
1 vfs_readdir Reading directory content |
1 vfs_unlink Unlinking file |
1 blocking_notifier_call_chain Blocking notifier |
1 lock_super Superblock lock contention |
1 vfs_create Creating a
file |
1 KAS_ScheduleTimeout Binary AMD driver delay |
1 firegl_lock_device Binary AMD driver delay |
2 __bread Synchronous buffer
read |
2 do_generic_mapping_read Reading
file data |
2 sock_sendmsg Sending data over socket |
2 do_sys_open Opening file |
2 do_sys_poll Waiting for
event (poll) |
2 core_sys_select Waiting
for event ( select ) |
2 proc_reg_read Reading from /proc
file |
2 __pollwait Waiting for
event (poll) |
2 sys_fcntl FCNTL system call |
2 scsi_error_handler SCSI error handler |
2 proc_root_readdir Reading /proc directory |
2 ksoftirqd Waking ksoftirqd |
2 do_unlinkat Unlinking
file |
2 __wait_on_buffer Waiting for
buffer IO to complete |
2 pdflush pdflush() kernel thread |
2 kjournald kjournald() kernel thread |
2 blkdev_ioctl block device IOCTL |
2 kauditd_thread kernel audit daemon |
2 __filemap_fdatawrite_range fdatasync system call |
2 do_sync_write synchronous write |
2 kthreadd kthreadd kernel thread |
2 usb_port_resume Waking up USB device |
2 usb_autoresume_device Waking up USB device |
2 kswapd kswapd() kernel thread |
2 md_thread Raid resync kernel thread |
2 i915_wait_request Waiting
for GPU command
to complete |
2 request_module Loading a kernel module |
3 tty_wait_until_sent Waiting
for TTY to finish sending |
3 pipe_read Reading from a pipe |
3 pipe_write Writing to a pipe |
3 pipe_wait Waiting for
pipe data |
3 read_block_bitmap Reading EXT3 block bitmaps |
3 scsi_execute_req Executing raw SCSI
command |
3 sys_wait4 Waiting for
a process to die |
3 sr_media_change Checking
for media change |
3 sr_do_ioctl SCSI cdrom ioctl |
3 sd_ioctl SCSI disk ioctl |
3 sr_cd_check Checking CDROM media present |
3 ext3_read_inode Reading EXT3 inode |
3 htree_dirblock_to_tree Reading EXT3 directory htree |
3 ext3_readdir Reading EXT3 directory |
3 ext3_bread Synchronous EXT3
read |
3 ext3_free_branches Unlinking
file on EXT3 |
3 ext3_get_branch Reading EXT3 indirect blocks |
3 ext3_find_entry EXT3: Looking
for file |
3 __ext3_get_inode_loc Reading EXT3 inode |
3 ext3_delete_inode EXT3 deleting inode |
3 sync_page Writing a page to disk |
3 tty_poll Waiting for
TTY data |
3 tty_read Waiting for
TTY input |
3 tty_write Writing data to TTY |
3 update_atime Updating inode atime |
3 page_cache_sync_readahead Pagecache
sync readahead |
3 do_fork Fork() system call |
3 sys_mkdirat Creating directory |
3 lookup_create Creating
file |
3 inet_sendmsg Sending TCP/IP data |
3 tcp_recvmsg Receiving TCP/IP data |
3 link_path_walk Following
symlink |
3 path_walk Walking directory tree |
3 sys_getdents Reading directory content |
3 unix_stream_recvmsg Waiting
for data on unix socket |
3 ext3_mkdir EXT3: Creating a directory |
3 journal_get_write_access EXT3: Waiting
for journal access |
3 synchronize_rcu Waiting
for RCU |
3 input_close_device Closing input device |
3 mousedev_close_device Closing mouse device |
3 mousedev_release Closing mouse device |
3 mousedev_open Opening mouse device |
3 kmsg_read Reading from dmesg |
3 sys_futex Userspace lock contention |
3 do_futex Userspace lock contention |
3 vt_waitactive vt_waitactive IOCTL |
3 acquire_console_sem Waiting
for console access |
3 filp_close Closing a file |
3 sync_inode (f)syncing an inode to disk |
3 ata_exec_internal_sg Executing internal ATA
command |
3 writeback_inodes Writing back inodes |
3 ext3_orphan_add EXT3 adding orphan |
3 ext3_mark_inode_dirty EXT3 marking inode dirty |
3 ext3_unlink EXT3 unlinking
file |
3 ext3_create EXT3 Creating a
file |
3 log_do_checkpoint EXT3 journal checkpoint |
3 generic_delete_inode Deleting an inode |
3 proc_delete_inode Removing /proc
file |
3 do_truncate Truncating
file |
3 sys_execve Executing a program |
3 journal_commit_transaction EXT3: committing transaction |
3 __stop_machine_run Freezing the kernel ( for
module load) |
3 sys_munmap unmapping memory |
3 sys_mmap mmaping memory |
3 sync_buffer Writing buffer to disk (synchronous) |
3 inotify_inode_queue_event Inotify event |
3 proc_lookup Looking up /proc
file |
3 generic_make_request Creating block layer request |
3 get_request_wait Creating block layer request |
3 alloc_page_vma Allocating a VMA |
3 blkdev_direct_IO Direct block device IO |
3 sys_mprotect mprotect() system call |
3 shrink_icache_memory reducing inode cache memory footprint |
3 vfs_stat_fd stat() operation |
3 cdrom_open opening cdrom device |
3 sys_epoll_wait Waiting
for event (epoll) |
3 sync_sb_inodes Syncing inodes |
3 tcp_connect TCP/IP connect |
3 ata_scsi_ioctl ATA/SCSI disk ioctl |
3 do_rmdir Removing directory |
3 vfs_rmdir Removing directory |
3 sys_flock flock() on a
file |
3 usbdev_open opening USB device |
3 lock_kernel Big Kernel Lock contention |
3 blk_execute_rq Submitting block IO |
3 scsi_cmd_ioctl SCSI ioctl
command |
3 acpi_ec_transaction ACPI hardware access |
3 journal_get_undo_access Waiting
for EXT3 journal undo operation |
3 i915_irq_wait Waiting
for GPU interrupt |
3 i915_gem_throttle_ioctl Throttling GPU
while waiting for
commands |
5 do_page_fault Page fault |
5 handle_mm_fault Page fault |
5 filemap_fault Page fault |
5 sync_filesystems Syncing filesystem |
5 sys_nanosleep Application requested delay |
5 sys_pause Application requested delay |
5 evdev_read Reading keyboard/mouse input |
5 do_fsync fsync() on a
file ( type
'F' for
details) |
5 __log_wait_for_space Waiting
for EXT3 journal space |
延遲原因非常的詳細.
本來到這裏,我要介紹的要介紹了,但是且慢,由於這個東西要在2.6.26後的系統上使用,我們的線上系統大部分是RHEL 5U4, 2.6.18的, 我們如何使用呢?
這時候 systemtap 一如既往的前來救助了!
systemtap 1.4版本以後帶了個latencytop.stp, 也是intel的貢獻. 那我們試驗下窮人家的latencytop.
它在那裏呢?
Systemtap translator/driver (version 1.5 /0.137 non-git sources) |
Copyright (C) 2005-2011 Red Hat, Inc. and others |
This is free
software; see the source
for copying conditions. |
enabled features: AVAHI LIBRPM LIBSQLITE3 NSS BOOST_SHARED_PTR TR1_UNORDERED_MAP NLS |
$ ls
-al /usr/share/doc/systemtap/examples/profiling/latencytap.stp
|
-rwxr-xr-x 1 chuba users
16240 Feb 17 22:02/usr/share/doc/systemtap/examples/profiling/latencytap.stp
|
$ sudo
stap -t --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
|
ERROR: Skipped too many probes, check MAXSKIPPED or try again with stap -t
for more
details. |
WARNING: Number of errors: 0, skipped probes: 101 |
WARNING: Skipped due to global 'dequeue'
lock timeout: 2 |
WARNING: Skipped due to global 'this_sleep'
lock timeout: 99 |
kernel.trace( "deactivate_task" )!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:47:1), hits: 254, cycles: 680min/43327avg/2248467max, from: kernel.trace( "deactivate_task" ) |
kernel.trace( "activate_task" )!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:58:1), hits: 255, cycles: 890min/502549avg/2271568max, from: kernel.trace( "activate_task" ) |
kernel. function ( "finish_task_switch@kernel/sched.c:1969" )?, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:78:7),
hits: 509, cycles: 213min/1002207avg/5382852max, from: kernel. function ( "finish_task_switch" ) from: scheduler.cpu_on |
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:123:1), hits: 1, cycles: 1802min/1802avg/1802max, from: begin |
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:131:1), hits: 1, cycles: 227979min/227979avg/227979max, from: begin |
Pass 5: run failed. Try again with another
'--vp 00001' option. |
出錯了! 原因是lock timeout, 原來stap的全局變量是用鎖保護的,現在超時了!知道原因好辦,打個patch吧!
$ diff
-up translate.cxx.orig translate.cxx |
--- translate.cxx.orig 2011-03-22 21:26:52.000000000 +0800 |
+++ /translate.cxx 2011-03-29 20:31:28.000000000 +0800 |
@@ -5802,10 +5802,10 @@ translate_pass (systemtap_session& s) |
s. op ->newline() <<
"#define MAXACTION_INTERRUPTIBLE (MAXACTION * 10)" ; |
s. op ->newline() <<
"#endif" ; |
s. op ->newline() <<
"#ifndef TRYLOCKDELAY" ; |
- s. op ->newline() <<
"#define TRYLOCKDELAY 10 /* microseconds */" ; |
+ s. op ->newline() <<
"#define TRYLOCKDELAY 50 /* microseconds */" ; |
s. op ->newline() <<
"#endif" ; |
s. op ->newline() <<
"#ifndef MAXTRYLOCK" ; |
- s. op ->newline() <<
"#define MAXTRYLOCK 100 /* 1 millisecond total */" ; |
+ s. op ->newline() <<
"#define MAXTRYLOCK 500 /* 1 millisecond total */" ; |
s. op ->newline() <<
"#endif" ; |
s. op ->newline() <<
"#ifndef MAXMAPENTRIES" ; |
s. op ->newline() <<
"#define MAXMAPENTRIES 2048" ; |
$ sudo
stap --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
|
ERROR: probe overhead exceeded threshold |
WARNING: Number of errors: 1, skipped probes: 0 |
Pass 5: run failed. Try again with another
'--vp 00001' option. |
$ sudo
stap -DSTP_NO_OVERLOAD --all-modules -DMAXSKIPPED=1024 /usr/share/doc/systemtap/examples/profiling/latencytap.stp
|
Reason Count Average(us) Maximum(us) Percent% |
Userspace lock contention 345 16409195 83258717 45% |
migration() kernel thread 1733 402701 3571412 5% |
Reading from a pipe 212 2922207 52151180 4% |
Waking ksoftirqd 16 16082822 59266312 2% |
Waiting for
event ( select ) 99 2113310 28510974 1% |
kjournald() kernel thread 148 1313447 13983084 1% |
Application requested delay 94 1059898 10011409 0% |
Waiting for
event ( select ) 38 2259444 29057362 0% |
Waiting for
event (poll) 1 57582711 57582711 0% |
Application requested delay 3 19030709 36000553 0% |
Waiting for
event ( select ) 39 1341880 5847683 0% |
這次看到結果了,哈哈,小高興一把。但是在繁忙的系統上這個腳本的資源佔用特別多,也是不爽的。 幸運的是這個腳本支持查看某個進程的延遲情況, 就是在 latencytap.stp 後面加個-x 參數。
這個腳本設計應該是支持進程ID, 但是結果寫成了線程ID,屬於bug!!!
動手改下吧:
$ diff
-u latencytap.stp.orig latencytap.stp |
--- latencytap.stp.orig 2011-02-17 22:02:40.000000000 +0800 |
+++ latencytap.stp 2011-03-29 20:43:51.000000000 +0800 |
- function
log_event:long (p:long) { return
(!traced_pid || traced_pid == p) } |
+ function
log_event:long (p:long) { return
(!traced_pid || traced_pid == task_pid(p)) } |
function
func_backtrace:string (ips:string) |
- if
(log_event($p->pid) && (s & 3)) { |
+ if
(log_event($p) && (s & 3)) { |
dequeue[$p] = gettimeofday_us(); |
probe kernel.trace( "activate_task" ) !, |
kernel. function ( "activate_task" ) { |
- if
(!log_event($p->pid)) next |
+ if
(!log_event($p)) next |
$ sudo
stap --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp -x $$ |
這下終於爽了,舊內核用systemtap版本的,新內核用內核版本的,世界和諧!
通過對線上MySQL的診斷髮現大部分時間花在mutex鎖的競爭上來,我說過了,我會收拾你的,等着瞧!
玩得開心!