On Wed, Aug 27, 2008 at 09:30:52AM +0100, Steven Whitehouse wrote: > There are a few things to check. Firstly compare /proc/slabinfo on a > slow node with that on a node running at normal speed. That will tell > you if there is a problem with memory leaking or not being reclaimed > properly. Nothing interesting at a quick glance over slabinfo. The active node has much more consumed memory than the inactive one, which would seem normal since the busted node hasn't done any work in hours. Anything large is expected - icache, dcache, buffer_head. The active node has more objects then the busted one. > If the node seems stuck, then try and echo t >/proc/sysrq-trigger and > look at the backtraces of any process which has called into gfs2 to see > where they are waiting. Also a dump of the glocks (you'll need to have > debugfs mounted) on all nodes should then allow you to work out whether > something on the stuck nodes is waiting for something on one of the > other nodes. Sometimes its useful to look at the DLM locks as well. Okay - processes that are hung due to I/O on GFS2 filesystem all have a similar call stack: ======================= ls D C981824A 2928 11112 11000 (NOTLB) e3065e48 00200086 f8cc4b9d c981824a 00008773 f5126a80 00000008 f5876550 c20e7550 c990b3d9 00008773 000f318f 00000001 f587665c c20049e0 00000044 f8c4b10b f5327ac0 f8c4b83e ffffffff 00000000 00000000 e3065e74 00000000 Call Trace: [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm] [<f8c4b10b>] gdlm_ast+0x0/0x2 [lock_dlm] [<f8c4b83e>] gdlm_bast+0x0/0x76 [lock_dlm] [<f8d21c99>] just_schedule+0x5/0x8 [gfs2] [<c0604d68>] __wait_on_bit+0x33/0x58 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2] [<f8d21c94>] just_schedule+0x0/0x8 [gfs2] [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a [<c0436076>] wake_bit_function+0x0/0x3c [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2] [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2] [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2] [<f8d24177>] gfs2_glock_nq_atime+0x164/0x2de [gfs2] [<f8d2b7dd>] gfs2_readdir+0x47/0x8b [gfs2] [<c047f754>] filldir64+0x0/0xc5 [<f8d2416f>] gfs2_glock_nq_atime+0x15c/0x2de [gfs2] [<c047f935>] vfs_readdir+0x63/0x8d [<c047f754>] filldir64+0x0/0xc5 [<c047f9c2>] sys_getdents64+0x63/0xa5 [<c0404eff>] syscall_call+0x7/0xb ======================= python D 2A0CCE0D 1676 10551 10175 (NOTLB) f468fd7c 00000082 00000096 2a0cce0d 00000828 00000001 00000009 f5023550 c20e7550 2a0ce9ae 00000828 00001ba1 00000001 f502365c c20049e0 f40af2c4 f8cc4b9d 00000000 f40af2c0 ffffffff 00000000 00000000 f468fda8 00000000 Call Trace: [<f8cc4b9d>] put_rsb+0x27/0x36 [dlm] [<f8d21c99>] just_schedule+0x5/0x8 [gfs2] [<c0604d68>] __wait_on_bit+0x33/0x58 [<f8d21c94>] just_schedule+0x0/0x8 [gfs2] [<f8d21c94>] just_schedule+0x0/0x8 [gfs2] [<c0604def>] out_of_line_wait_on_bit+0x62/0x6a [<c0436076>] wake_bit_function+0x0/0x3c [<f8d21c90>] wait_on_holder+0x27/0x2b [gfs2] [<f8d22e32>] glock_wait_internal+0xdb/0x1ec [gfs2] [<f8d230b1>] gfs2_glock_nq+0x16e/0x18e [gfs2] [<f8d2e911>] gfs2_permission+0x69/0xb4 [gfs2] [<f8d2e90a>] gfs2_permission+0x62/0xb4 [gfs2] [<f8d2e8a8>] gfs2_permission+0x0/0xb4 [gfs2] [<c047b557>] permission+0x78/0xb5 [<c047c9c0>] __link_path_walk+0x141/0xd33 [<f8d23322>] gfs2_glock_dq+0x9e/0xb2 [gfs2] [<c048d67a>] __mark_inode_dirty+0x13d/0x14f [<c047d5fb>] link_path_walk+0x49/0xbd [<c044ae04>] audit_syscall_entry+0x11c/0x14e [<c047d9c8>] do_path_lookup+0x20e/0x25e [<c047ded5>] sys_mkdirat+0x36/0xb6 [<c044ae04>] audit_syscall_entry+0x11c/0x14e [<c047df64>] sys_mkdir+0xf/0x13 [<c0404eff>] syscall_call+0x7/0xb ======================= I've got dumps of the glocks from debugfs, but I'm not really familiar enough to GFS to understand what I'm reading. I tried to file a bug in RH Bugzilla, but am getting 503 errors. I've posted the glock dumps here: http://kallisti.us/~ross/working-glocks http://kallisti.us/~ross/broken-glocks Can you point me in the direction of a document that explains what the various things in the output mean? -- Ross Vandegrift ross@xxxxxxxxxxx "The good Christian should beware of mathematicians, and all those who make empty prophecies. The danger already exists that the mathematicians have made a covenant with the devil to darken the spirit and to confine man in the bonds of Hell." --St. Augustine, De Genesi ad Litteram, Book II, xviii, 37 -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster