thanks for you assistance!
I reproduced the problem and I have done the tests you mentioned.
Regarding gndb:
gnbd_import -l tool reports "Open, Connected" state and gndb_export -L
on the gnbd server also shows all the hosts importing this partition.
The " cat /sys/class/gnbd/gnbd0/waittime" also shows no data pending
(returns -1).
Though in the message log there were some strange lines about gnbd failures appeared after the "killall httpd" command was issued:
gnbd (pid 5836: alogc.pl) got signal 9
gnbd0: Send control failed (result -4)
gnbd (pid 5836: alogc.pl) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5911: httpd) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5897: httpd) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5915: httpd) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5911: httpd) got signal 15
gnbd0: Send control failed (result -4)
Regarding ps info on wchan - it looks like this:
ps axl info on IO-waiting processes:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 51 6 15 0 0 0 wait_o D ? 0:00 [pdflush]
1 0 5771 6 5 -10 0 0 lock_p D< ? 0:00 [lock_dlm1]
1 0 5776 1 15 0 0 0 - D ? 0:00 [gfs_logd]
1 0 5777 1 15 0 0 0 - D ? 0:00 [gfs_quotad]
1 0 5778 1 15 0 0 0 - D ? 0:00 [gfs_inoded]
5 0 5892 1 16 0 23440 912 - Ds ? 0:00 /usr/system/apache/bin/httpd
5 48 5895 5892 17 0 23472 984 glock_ D ? 0:00 /usr/system/apache/bin/httpd
5 48 5896 5892 17 0 23440 980 glock_ D ? 0:00 /usr/system/apache/bin/httpd
5 48 5897 5892 17 0 23440 920 glock_ D ? 0:00 /usr/system/apache/bin/httpd
5 48 5911 5892 17 0 23440 920 glock_ D ? 0:00 /usr/system/apache/bin/httpd
5 48 5915 5892 17 0 23440 920 wait_o D ? 0:00 /usr/system/apache/bin/httpd
4 0 5930 2547 34 19 52780 992 wait_o DN ? 0:00 /bin/sh -c run-parts /etc/cron.da
ily
Not truncated version of the "wchan" field for all the IO-waiting processes is below:
bash-3.00# ps ax -o pid,state,wchan:32,ucomm |grep D
PID S WCHAN COMMAND
51 D wait_on_buffer pdflush
5771 D lock_page lock_dlm1
5776 D - gfs_logd
5777 D - gfs_quotad
5778 D - gfs_inoded
5892 D - httpd
5895 D glock_wait_internal httpd
5896 D glock_wait_internal httpd
5897 D glock_wait_internal httpd
5911 D glock_wait_internal httpd
5915 D wait_on_buffer httpd
5930 D wait_on_buffer sh
Finally I have taken the "sysrq" info on these processes.
pdflush D ffffffff8014aabc 0 51 6 53 50 (L-TLB)
00000100dfc3dc78 0000000000000046 000001011bd3e980 000001010fc11f00
0000000000000216 ffffffffa0042916 000001011aca60c0 0000000000000008
000001011fdef7f0 0000000000000dfa
Call Trace:<ffffffffa0042916>{:dm_mod:dm_request+396} <ffffffff8014aabc>{keventd_create_kthread+0}
<ffffffff803053ef>{io_schedule+38} <ffffffff80178c4c>{__wait_on_buffer+125}
<ffffffff80178ad2>{bh_wake_function+0} <ffffffff80178ad2>{bh_wake_function+0}
<ffffffffa0235c5d>{:gfs:gfs_logbh_wait+49} <ffffffffa024a6a6>{:gfs:disk_commit+794}
<ffffffffa024a877>{:gfs:log_refund+111} <ffffffffa024ad8e>{:gfs:log_flush_internal+510}
<ffffffff8017d682>{sync_supers+167} <ffffffff8015e310>{wb_kupdate+36}
<ffffffff8015edb4>{pdflush+323} <ffffffff8015e2ec>{wb_kupdate+0}
<ffffffff8015ec71>{pdflush+0} <ffffffff8014aa93>{kthread+200}
<ffffffff80110e17>{child_rip+8} <ffffffff8014aabc>{keventd_create_kthread+0}
<ffffffff8014a9cb>{kthread+0} <ffffffff80110e0f>{child_rip+0}
lock_dlm1 D 000001000c0096e0 0 5771 6 5772 5766 (L-TLB)
0000010113ce3c58 0000000000000046 0000001000000000 0000010000000069
000001011420b030 0000000000000069 000001000c00a940 000000010000eb10
000001011a887030 0000000000001cae
Call Trace:<ffffffff802496d4>{__generic_unplug_device+19} <ffffffff803053ef>{io_schedule+38}
<ffffffff80159215>{__lock_page+191} <ffffffff80158cfa>{page_wake_function+0}
<ffffffff80158cfa>{page_wake_function+0} <ffffffff80163125>{truncate_inode_pages+519}
<ffffffffa0258f35>{:gfs:gfs_inval_page+63} <ffffffffa02401b5>{:gfs:drop_bh+233}
<ffffffffa0242138>{:gfs:gfs_glock_cb+194} <ffffffffa02869dd>{:lock_dlm:dlm_async+1989}
<ffffffff801333c8>{default_wake_function+0} <ffffffff8014aabc>{keventd_create_kthread+0}
<ffffffffa0286218>{:lock_dlm:dlm_async+0} <ffffffff8014aabc>{keventd_create_kthread+0}
<ffffffff8014aa93>{kthread+200} <ffffffff80110e17>{child_rip+8}
<ffffffff8014aabc>{keventd_create_kthread+0} <ffffffff8014a9cb>{kthread+0}
<ffffffff80110e0f>{child_rip+0}
gfs_logd D 0000000000000000 0 5776 1 5777 5775 (L-TLB)
000001011387fe38 0000000000000046 0000000000000000 ffffffff80304a85
000001011387fe58 ffffffff80304add ffffffff803cca80 0000000000000246
00000101143fe030 00000000000000b5
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88}
<ffffffffa023e8d3>{:gfs:lock_on_glock+112} <ffffffff8030565b>{__down_write+134}
<ffffffffa0249cdb>{:gfs:gfs_ail_empty+56} <ffffffffa0233930>{:gfs:gfs_logd+77}
<ffffffff80110e17>{child_rip+8} <ffffffff801cccff>{dummy_d_instantiate+0}
<ffffffffa02338e3>{:gfs:gfs_logd+0} <ffffffff80110e0f>{child_rip+0}
gfs_quotad D 0000000000000000 0 5777 1 5778 5776 (L-TLB)
0000010113881e98 0000000000000046 0000000000000000 ffffffff80304a85
0000010113881eb8 ffffffff80304add 000001011ff87030 0000000100000074
000001011430f7f0 0000000000000128
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88}
<ffffffff8030565b>{__down_write+134} <ffffffffa025b55a>{:gfs:gfs_quota_sync+226}
<ffffffffa0233ab1>{:gfs:gfs_quotad+127} <ffffffff80110e17>{child_rip+8}
<ffffffff801cccff>{dummy_d_instantiate+0} <ffffffff801cccff>{dummy_d_instantiate+0}
<ffffffff801cccff>{dummy_d_instantiate+0} <ffffffffa0233a32>{:gfs:gfs_quotad+0}
<ffffffff80110e0f>{child_rip+0}
gfs_inoded D 0000000000000000 0 5778 1 5807 5777 (L-TLB)
0000010113883e98 0000000000000046 000001011e2937f0 000001000c0096e0
0000000000000000 ffffffff80304a85 0000010113883ec8 0000000180304add
000001011e2937f0 00000000000000c2
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff8030565b>{__down_write+134}
<ffffffffa026160d>{:gfs:unlinked_find+115} <ffffffffa0261c6c>{:gfs:gfs_unlinked_dealloc+25}
<ffffffffa0233bd5>{:gfs:gfs_inoded+66} <ffffffff80110e17>{child_rip+8}
<ffffffffa0233b93>{:gfs:gfs_inoded+0} <ffffffff80110e0f>{child_rip+0}
httpd D ffffffff80304190 0 5892 1 5893 5826 (NOTLB)
0000010111b75bf8 0000000000000002 0000000000000001 0000000000000001
0000000000000000 0000000000000000 0000010114667980 0000000111b75bc0
00000101143fe7f0 00000000000009ad
Call Trace:<ffffffff80303d6f>{__down+147} <ffffffff801333c8>{default_wake_function+0}
<ffffffff8015b3a2>{generic_file_write_nolock+158} <ffffffff80305780>{__down_failed+53}
<ffffffffa0236986>{:gfs:.text.lock.dio+95} <ffffffffa0260e4c>{:gfs:gfs_trans_add_bh+205}
<ffffffffa0253efc>{:gfs:do_write_buf+1138} <ffffffffa0252db3>{:gfs:walk_vm+278}
<ffffffffa0253a8a>{:gfs:do_write_buf+0} <ffffffffa0253a8a>{:gfs:do_write_buf+0}
<ffffffffa025415b>{:gfs:__gfs_write+201} <ffffffff80177c60>{vfs_write+207}
<ffffffff80177d48>{sys_write+69} <ffffffff801101c6>{system_call+126}
httpd D 0000010110ad7d48 0 5895 5892 5896 5893 (NOTLB)
0000010110ad7bd8 0000000000000006 000001011b16e030 0000000000000075
0000010117002030 0000000000000075 000001000c002940 0000000000000001
00000101170027f0 000000000001300e
Call Trace:<ffffffff80131d1d>{try_to_wake_up+863} <ffffffff80304cbd>{wait_for_completion+167}
<ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
<ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
<ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
<ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
<ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
<ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
<ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
httpd D 0000010110b5bd48 0 5896 5892 5897 5895 (NOTLB)
0000010110b5bbd8 0000000000000002 00000101170027f0 0000000000000075
00000101114787f0 0000000000000075 000001000c002940 0000000000000001
0000010117002030 000000000000fb3e
Call Trace:<ffffffff80131d1d>{try_to_wake_up+863} <ffffffff80304cbd>{wait_for_completion+167}
<ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
<ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
<ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
<ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
<ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
<ffffffff801236d2>{do_page_fault+518} <ffffffff802a3445>{sys_accept+327}
<ffffffff80182e88>{pipe_read+26} <ffffffff80110c61>{error_exit+0}
httpd D 0000000000000000 0 5897 5892 5911 5896 (NOTLB)
0000010110119bd8 0000000000000006 0000010117002030 0000000000000075
0000010117002030 0000000000000075 000001000c00a940 000000001b16e030
00000101114787f0 000000000000fbe0
Call Trace:<ffffffff802496d4>{__generic_unplug_device+19} <ffffffff80304cbd>{wait_for_completion+167}
<ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
<ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
<ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
<ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
<ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
<ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
<ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
httpd D 00000101100c3d48 0 5911 5892 5915 5897 (NOTLB)
00000101100c3bd8 0000000000000002 000001011420b7f0 0000000000000075
00000101170027f0 0000000000000075 000001000c002940 0000000000000000
000001011b16e030 000000000000187e
Call Trace:<ffffffff80131d1d>{try_to_wake_up+863} <ffffffff80304cbd>{wait_for_completion+167}
<ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
<ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
<ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
<ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
<ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
<ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
<ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
httpd D 0000000000006a36 0 5915 5892 5911 (NOTLB)
00000101180f7ad8 0000000000000006 0000000000002706 ffffffffa020c791
0000000000000000 0000000000000000 0000030348ac8c1c 0000000114a217f0
0000010114c997f0 000000000000076a
Call Trace:<ffffffffa020c791>{:dlm:lkb_swqueue+43} <ffffffff803053ef>{io_schedule+38}
<ffffffff80178c4c>{__wait_on_buffer+125} <ffffffff80178ad2>{bh_wake_function+0}
<ffffffff80178ad2>{bh_wake_function+0} <ffffffffa02352c6>{:gfs:gfs_dreread+154}
<ffffffffa0235332>{:gfs:gfs_dread+40} <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
<ffffffffa0242999>{:gfs:gfs_copyin_dinode+23} <ffffffffa0242461>{:gfs:inode_go_lock+38}
<ffffffffa023f586>{:gfs:glock_wait_internal+563} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
<ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
<ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
<ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
<ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
<ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
sh D 000000000000001a 0 5930 2547 (NOTLB)
000001011090f8e8 0000000000000002 0000010111293d88 0000010110973d00
0000010111293d88 0000000000000000 00000100dfc02400 0000000000010000
00000101148557f0 0000000000002010
Call Trace:<ffffffff803053ef>{io_schedule+38} <ffffffff80178c4c>{__wait_on_buffer+125}
<ffffffff80178ad2>{bh_wake_function+0} <ffffffff80178ad2>{bh_wake_function+0}
<ffffffffa02352c6>{:gfs:gfs_dreread+154} <ffffffffa0235332>{:gfs:gfs_dread+40}
<ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201} <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
<ffffffffa0242461>{:gfs:inode_go_lock+38} <ffffffffa023f586>{:gfs:glock_wait_internal+563}
<ffffffffa023fce6>{:gfs:gfs_glock_nq+961} <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
<ffffffff801ccb78>{dummy_inode_permission+0} <ffffffffa0257aca>{:gfs:gfs_permission+64}
<ffffffff8018d475>{dput+56} <ffffffff80183d32>{permission+51}
<ffffffff801844aa>{__link_path_walk+372} <ffffffff801851c2>{link_path_walk+82}
<ffffffff8012370b>{do_page_fault+575} <ffffffff801849b0>{__link_path_walk+1658}
<ffffffff801851c2>{link_path_walk+82} <ffffffff8012370b>{do_page_fault+575}
<ffffffff8018540f>{path_lookup+451} <ffffffff801856bb>{__user_walk+47}
<ffffffff8017ff1a>{vfs_stat+24} <ffffffff8012370b>{do_page_fault+575}
<ffffffff80180264>{sys_newstat+17} <ffffffff80110c61>{error_exit+0}
<ffffffff801101c6>{system_call+126}
Please, let me know if it gives you any clues.
On 6/15/06, David Teigland <teigland@xxxxxxxxxx> wrote:
On Thu, Jun 15, 2006 at 01:43:25AM +0300, Anton Kornev wrote:
> Is there any ideas of how to fix this? I mean either the reason ('D'
> state of killed httpd-s) or consequences (the GFS filesystem fully or
> partially become unavailable after this).
>
> I also appreciate any help with debugging the problem.
>
> I tried gfs_tool lockdump with decipher_lockstate_dump tool.
I don't see anything wrong in the lockdumps you gave, although I'm not an
expert at interpreting gfs lockdumps. Could you do a ps showing the wchan
for those processes? Using sysrq to get a stack dump would also be useful.
You might also do a dlm lock dump and pick out those locks:
echo "lockspace name" >> /proc/cluster/dlm_locks
cat /proc/cluster/dlm_locks
I/O stuck in gnbd could also be a problem, I'm not sure what the signs of
that might be apart from possibly the wchan.
Dave
--
Best Regards,
Anton Kornev.
-- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster