I'm having an identical problem.
I have 2 nodes running a Wordpress instance with a TCP load balancer in
front of them distributing http requests between them.
In the last 2 days, I've had 10+ instances where the GFS2 volume hangs
with:
Sep 16 14:05:10 wordpress3 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 16 14:05:10 wordpress3 kernel: delete_workqu D 00000272 2676
3687 19 3688 3686 (L-TLB)
Sep 16 14:05:10 wordpress3 kernel: f7839e38 00000046 3f1c322e
00000272 00000000 f57ab400 f7839df8 0000000a
Sep 16 14:05:10 wordpress3 kernel: c3217aa0 3f1dcca8 00000272
00019a7a 00000001 c3217bac c3019744 f57c5ac0
Sep 16 14:05:10 wordpress3 kernel: f8afa21c 00000003 f26162f0
00000000 f2213df8 00000018 c3019c00 f7839e6c
Sep 16 14:05:10 wordpress3 kernel: Call Trace:
Sep 16 14:05:10 wordpress3 kernel: [<f8afa21c>] gdlm_bast+0x0/0x78
[lock_dlm]
Sep 16 14:05:10 wordpress3 kernel: [<f8c3910e>] just_schedule+0x5/0x8
[gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<c061d2f5>] __wait_on_bit+0x33/0x58
Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8
[gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8
[gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<c061d37c>]
out_of_line_wait_on_bit+0x62/0x6a
Sep 16 14:05:10 wordpress3 kernel: [<c0436098>] wake_bit_function+0x0/0x3c
Sep 16 14:05:10 wordpress3 kernel: [<f8c39102>]
gfs2_glock_wait+0x27/0x2e [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<f8c4c667>]
gfs2_check_blk_type+0xbc/0x18c [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<c061d312>] __wait_on_bit+0x50/0x58
Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8
[gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<f8c4c660>]
gfs2_check_blk_type+0xb5/0x18c [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<f8c4c3c8>]
gfs2_rindex_hold+0x2b/0x148 [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<f8c48273>]
gfs2_delete_inode+0x6f/0x1a1 [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<f8c4823b>]
gfs2_delete_inode+0x37/0x1a1 [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<f8c48204>]
gfs2_delete_inode+0x0/0x1a1 [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<c048cb02>]
generic_delete_inode+0xa5/0x10f
Sep 16 14:05:10 wordpress3 kernel: [<c048c5a6>] iput+0x64/0x66
Sep 16 14:05:10 wordpress3 kernel: [<f8c3a8bb>]
delete_work_func+0x49/0x53 [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<c04332da>] run_workqueue+0x78/0xb5
Sep 16 14:05:10 wordpress3 kernel: [<f8c3a872>]
delete_work_func+0x0/0x53 [gfs2]
Sep 16 14:05:10 wordpress3 kernel: [<c0433b8e>] worker_thread+0xd9/0x10b
Sep 16 14:05:10 wordpress3 kernel: [<c041f81b>]
default_wake_function+0x0/0xc
Sep 16 14:05:10 wordpress3 kernel: [<c0433ab5>] worker_thread+0x0/0x10b
Sep 16 14:05:10 wordpress3 kernel: [<c0435fa7>] kthread+0xc0/0xed
Sep 16 14:05:10 wordpress3 kernel: [<c0435ee7>] kthread+0x0/0xed
Sep 16 14:05:10 wordpress3 kernel: [<c0405c53>]
kernel_thread_helper+0x7/0x10
And then a bunch more for the httpd processes. I can pretty much
reproduce this consistently by untarring a large tarball on the volume.
Seems like anything IO intensive is causing this behavior.
Running CentOS 5.5 with kernel 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10
19:09:06 EDT 2010 i686 i686 i386 GNU/Linux
I tried the hangalizer program and it always came back with:
/bin/ls: /gfs2/: No such file or directoryhb.medianewsgroup.com "/bin/ls
/gfs2/"
/bin/ls: /gfs2/: No such file or directoryhb.medianewsgroup.com "/bin/ls
/gfs2/"
No waiting glocks found on any node.
Any Ideas?
On 08/03/2010 01:38 PM, Scooter Morris wrote:
HI all,
We continue to have gfs2 crashes and hangs on our production
cluster, so I'm beginning to think that we've done something really
wrong. Here is our set-up:
* 4 node cluster, only 3 participate in gfs2 filesystems
* Running several services on multiple nodes using gfs2:
o IMAP (dovecot)
o Web (apache with lots of python)
o Samba (using ctdb)
* GFS2 partitions are multipathed on an HP EVA-based SAN (no LVM)
-- here is fstab from one node (the three nodes are all the same):
LABEL=/1 / ext3
defaults 1 1
LABEL=/boot1 /boot ext3
defaults 1 2
tmpfs /dev/shm tmpfs
defaults 0 0
devpts /dev/pts devpts
gid=5,mode=620 0 0
sysfs /sys sysfs
defaults 0 0
proc /proc proc
defaults 0 0
LABEL=SW-cciss/c0d0p2 swap swap
defaults 0 0
LABEL=plato:Mail /var/spool/mail gfs2
noatime,_netdev
LABEL=plato:VarTmp /var/tmp gfs2 _netdev
LABEL=plato:UsrLocal /usr/local gfs2
noatime,_netdev
LABEL=plato:UsrLocalProjects /usr/local/projects gfs2
noatime,_netdev
LABEL=plato:Home2 /home/socr gfs2
noatime,_netdev
LABEL=plato:HomeNoBackup /home/socr/nobackup gfs2 _netdev
LABEL=plato:DbBackup /databases/backups gfs2
noatime,_netdev
LABEL=plato:DbMol /databases/mol gfs2
noatime,_netdev
LABEL=plato:MolDbBlast /databases/mol/blast gfs2
noatime,_netdev
LABEL=plato:MolDbEmboss /databases/mol/emboss gfs2
noatime,_netdev
* Kernel version is: 2.6.18-194.3.1.el5 and all nodes are x86_64.
* What's happening is every so often, we start seeing gfs2-related
task hangs in the logs. In the last instance (last Friday)
we've got this:
Node 0:
[2010-07-30 13:23:25]INFO: task imap:25716 blocked for
more than 120 seconds.^M
[2010-07-30 13:23:25]"echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this
message.^M
[2010-07-30 13:23:25]imap D ffff8100010825a0
0 25716 9217 24080 25667 (NOTLB)^M
[2010-07-30 13:23:25] ffff810619b59bc8 0000000000000086
ffff810113233f10 ffffffff00000000^M
[2010-07-30 13:23:26] ffff81000f8c5cd0 000000000000000a
ffff810233416040 ffff81082fd05100^M
[2010-07-30 13:23:26] 00012196d153c88e 0000000000008b81
ffff810233416228 0000000f6a949180^M
[2010-07-30 13:23:26]Call Trace:^M
[2010-07-30 13:23:26] [<ffffffff887d0be6>]
:gfs2:gfs2_dirent_find+0x0/0x4e^M
[2010-07-30 13:23:26] [<ffffffff887d0c18>]
:gfs2:gfs2_dirent_find+0x32/0x4e^M
[2010-07-30 13:23:26] [<ffffffff887d5ee7>]
:gfs2:just_schedule+0x0/0xe^M
[2010-07-30 13:23:26] [<ffffffff887d5ef0>]
:gfs2:just_schedule+0x9/0xe^M
[2010-07-30 13:23:26] [<ffffffff80063a16>]
__wait_on_bit+0x40/0x6e^M
[2010-07-30 13:23:26] [<ffffffff887d5ee7>]
:gfs2:just_schedule+0x0/0xe^M
[2010-07-30 13:23:26] [<ffffffff80063ab0>]
out_of_line_wait_on_bit+0x6c/0x78^M
[2010-07-30 13:23:26] [<ffffffff800a0aec>]
wake_bit_function+0x0/0x23^M
[2010-07-30 13:23:26] [<ffffffff887d5ee2>]
:gfs2:gfs2_glock_wait+0x2b/0x30^M
[2010-07-30 13:23:26] [<ffffffff887e579e>]
:gfs2:gfs2_permission+0x83/0xd5^M
[2010-07-30 13:23:26] [<ffffffff887e5796>]
:gfs2:gfs2_permission+0x7b/0xd5^M
[2010-07-30 13:23:26] [<ffffffff8000ce97>]
do_lookup+0x65/0x1e6^M
[2010-07-30 13:23:26] [<ffffffff8000d918>]
permission+0x81/0xc8^M
[2010-07-30 13:23:26] [<ffffffff8000997f>]
__link_path_walk+0x173/0xf42^M
[2010-07-30 13:23:26] [<ffffffff8000e9e2>]
link_path_walk+0x42/0xb2^M
[2010-07-30 13:23:26] [<ffffffff8000ccb2>]
do_path_lookup+0x275/0x2f1^M
[2010-07-30 13:23:26] [<ffffffff8001280e>]
getname+0x15b/0x1c2^M
[2010-07-30 13:23:27] [<ffffffff80023876>]
__user_walk_fd+0x37/0x4c^M
[2010-07-30 13:23:27] [<ffffffff80028846>]
vfs_stat_fd+0x1b/0x4a^M
[2010-07-30 13:23:27] [<ffffffff800638b3>]
schedule_timeout+0x92/0xad^M
[2010-07-30 13:23:27] [<ffffffff80097dab>]
process_timeout+0x0/0x5^M
[2010-07-30 13:23:27] [<ffffffff800f8435>]
sys_epoll_wait+0x3b8/0x3f9^M
[2010-07-30 13:23:27] [<ffffffff800235a8>]
sys_newstat+0x19/0x31^M
[2010-07-30 13:23:27] [<ffffffff8005d229>]
tracesys+0x71/0xe0^M
[2010-07-30 13:23:27] [<ffffffff8005d28d>]
tracesys+0xd5/0xe0^M
Node 1:
[2010-07-30 13:23:59]INFO: task pdflush:623 blocked for
more than 120 seconds.^M
[2010-07-30 13:23:59]"echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this
message.^M
[2010-07-30 13:23:59]pdflush D ffff810407069aa0
0 623 291 624 622 (L-TLB)^M
[2010-07-30 13:23:59] ffff8106073c1bd0 0000000000000046
0000000000000001 ffff8103fea899a8^M
[2010-07-30 13:23:59] ffff8106073c1c30 000000000000000a
ffff8105fff7c0c0 ffff8107fff4c820^M
[2010-07-30 13:24:00] 0000ed85d9d7a027 0000000000011b50
ffff8105fff7c2a8 00000006f0a9d0d0^M
[2010-07-30 13:24:00]Call Trace:^M
[2010-07-30 13:24:00] [<ffffffff8001a927>]
submit_bh+0x10a/0x111^M
[2010-07-30 13:24:00] [<ffffffff88802ee7>]
:gfs2:just_schedule+0x0/0xe^M
[2010-07-30 13:24:00] [<ffffffff88802ef0>]
:gfs2:just_schedule+0x9/0xe^M
[2010-07-30 13:24:00] [<ffffffff80063a16>]
__wait_on_bit+0x40/0x6e^M
[2010-07-30 13:24:00] [<ffffffff88802ee7>]
:gfs2:just_schedule+0x0/0xe^M
[2010-07-30 13:24:00] [<ffffffff80063ab0>]
out_of_line_wait_on_bit+0x6c/0x78^M
[2010-07-30 13:24:00] [<ffffffff800a0aec>]
wake_bit_function+0x0/0x23^M
[2010-07-30 13:24:00] [<ffffffff88802ee2>]
:gfs2:gfs2_glock_wait+0x2b/0x30^M
[2010-07-30 13:24:00] [<ffffffff88813269>]
:gfs2:gfs2_write_inode+0x5f/0x152^M
[2010-07-30 13:24:00] [<ffffffff88813261>]
:gfs2:gfs2_write_inode+0x57/0x152^M
[2010-07-30 13:24:00] [<ffffffff8002fbf8>]
__writeback_single_inode+0x1e9/0x328^M
[2010-07-30 13:24:00] [<ffffffff80020ec9>]
sync_sb_inodes+0x1b5/0x26f^M
[2010-07-30 13:24:00] [<ffffffff800a08a6>]
keventd_create_kthread+0x0/0xc4^M
[2010-07-30 13:24:00] [<ffffffff8005123a>]
writeback_inodes+0x82/0xd8^M
[2010-07-30 13:24:00] [<ffffffff800c97b5>]
wb_kupdate+0xd4/0x14e^M
[2010-07-30 13:24:00] [<ffffffff80056879>] pdflush+0x0/0x1fb^M
[2010-07-30 13:24:00] [<ffffffff800569ca>]
pdflush+0x151/0x1fb^M
[2010-07-30 13:24:00] [<ffffffff800c96e1>]
wb_kupdate+0x0/0x14e^M
[2010-07-30 13:24:01] [<ffffffff80032894>]
kthread+0xfe/0x132^M
[2010-07-30 13:24:01] [<ffffffff8009d734>]
request_module+0x0/0x14d^M
[2010-07-30 13:24:01] [<ffffffff8005dfb1>]
child_rip+0xa/0x11^M
[2010-07-30 13:24:01] [<ffffffff800a08a6>]
keventd_create_kthread+0x0/0xc4^M
[2010-07-30 13:24:01] [<ffffffff80032796>] kthread+0x0/0x132^M
[2010-07-30 13:24:01] [<ffffffff8005dfa7>]
child_rip+0x0/0x11^M
Node 2:
[2010-07-30 13:24:46]INFO: task delete_workqueu:7175
blocked for more than 120 seconds.^M
[2010-07-30 13:24:46]"echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this
message.^M
[2010-07-30 13:24:46]delete_workqu D ffff81082b5cf860
0 7175 329 7176 7174 (L-TLB)^M
[2010-07-30 13:24:46] ffff81081ed6dbf0 0000000000000046
0000000000000018 ffffffff887a84f3^M
[2010-07-30 13:24:46] 0000000000000286 000000000000000a
ffff81082dd477e0 ffff81082b5cf860^M
[2010-07-30 13:24:46] 00012166bf7ec21d 000000000002ed0b
ffff81082dd479c8 00000007887a9e5a^M
[2010-07-30 13:24:46]Call Trace:^M
[2010-07-30 13:24:46] [<ffffffff887a84f3>]
:dlm:request_lock+0x93/0xa0^M
[2010-07-30 13:24:47] [<ffffffff8884f556>]
:lock_dlm:gdlm_ast+0x0/0x311^M
[2010-07-30 13:24:47] [<ffffffff8884f2c1>]
:lock_dlm:gdlm_bast+0x0/0x8d^M
[2010-07-30 13:24:47] [<ffffffff887d3ee7>]
:gfs2:just_schedule+0x0/0xe^M
[2010-07-30 13:24:47] [<ffffffff887d3ef0>]
:gfs2:just_schedule+0x9/0xe^M
[2010-07-30 13:24:47] [<ffffffff80063a16>]
__wait_on_bit+0x40/0x6e^M
[2010-07-30 13:24:47] [<ffffffff887d3ee7>]
:gfs2:just_schedule+0x0/0xe^M
[2010-07-30 13:24:47] [<ffffffff80063ab0>]
out_of_line_wait_on_bit+0x6c/0x78^M
[2010-07-30 13:24:47] [<ffffffff800a0aec>]
wake_bit_function+0x0/0x23^M
[2010-07-30 13:24:47] [<ffffffff887d3ee2>]
:gfs2:gfs2_glock_wait+0x2b/0x30^M
[2010-07-30 13:24:47] [<ffffffff887e82cf>]
:gfs2:gfs2_check_blk_type+0xd7/0x1c9^M
[2010-07-30 13:24:47] [<ffffffff887e82c7>]
:gfs2:gfs2_check_blk_type+0xcf/0x1c9^M
[2010-07-30 13:24:47] [<ffffffff80063ab0>]
out_of_line_wait_on_bit+0x6c/0x78^M
[2010-07-30 13:24:47] [<ffffffff887e804f>]
:gfs2:gfs2_rindex_hold+0x32/0x12b^M
[2010-07-30 13:24:47] [<ffffffff887d5a29>]
:gfs2:delete_work_func+0x0/0x65^M
[2010-07-30 13:24:47] [<ffffffff887d5a29>]
:gfs2:delete_work_func+0x0/0x65^M
[2010-07-30 13:24:47] [<ffffffff887e3e3a>]
:gfs2:gfs2_delete_inode+0x76/0x1b4^M
[2010-07-30 13:24:47] [<ffffffff887e3e01>]
:gfs2:gfs2_delete_inode+0x3d/0x1b4^M
[2010-07-30 13:24:47] [<ffffffff8000d3ba>] dput+0x2c/0x114^M
[2010-07-30 13:24:48] [<ffffffff887e3dc4>]
:gfs2:gfs2_delete_inode+0x0/0x1b4^M
[2010-07-30 13:24:48] [<ffffffff8002f35e>]
generic_delete_inode+0xc6/0x143^M
[2010-07-30 13:24:48] [<ffffffff887d5a83>]
:gfs2:delete_work_func+0x5a/0x65^M
[2010-07-30 13:24:48] [<ffffffff8004d8f0>]
run_workqueue+0x94/0xe4^M
[2010-07-30 13:24:48] [<ffffffff8004a12b>]
worker_thread+0x0/0x122^M
[2010-07-30 13:24:48] [<ffffffff800a08a6>]
keventd_create_kthread+0x0/0xc4^M
[2010-07-30 13:24:48] [<ffffffff8004a21b>]
worker_thread+0xf0/0x122^M
[2010-07-30 13:24:48] [<ffffffff8008d087>]
default_wake_function+0x0/0xe^M
[2010-07-30 13:24:48] [<ffffffff800a08a6>]
keventd_create_kthread+0x0/0xc4^M
[2010-07-30 13:24:48] [<ffffffff800a08a6>]
keventd_create_kthread+0x0/0xc4^M
[2010-07-30 13:24:48] [<ffffffff80032894>]
kthread+0xfe/0x132^M
[2010-07-30 13:24:48] [<ffffffff8005dfb1>]
child_rip+0xa/0x11^M
[2010-07-30 13:24:48] [<ffffffff800a08a6>]
keventd_create_kthread+0x0/0xc4^M
[2010-07-30 13:24:48] [<ffffffff80032796>] kthread+0x0/0x132^M
[2010-07-30 13:24:48] [<ffffffff8005dfa7>]
child_rip+0x0/0x11^M
* Various messages related to hung_task_timeouts repeated on each
node (usually related to imap).
* Within a minute or two, the cluster was completely hung. Root
could log into the console, but commands (like dmesg) would just
hang.
So, my major question: is there something wrong with my
configuration? Have we done something really stupid? The initial
response from RedHat was that we shouldn't run services on multiple
nodes that access gfs2, which seems a little confusing since we would
use ext3 or ext4 if we were going to node lock (or failover) the
partitions. Have we missed something somewhere?
Thanks in advance for any help anyone can give. We're getting pretty
desperate here since the downtime is starting to have a significant
impact on our credibility.
-- scooter
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Jeff Howell
Sr. Linux Administrator
Media News Group interactive
303.563.6394 jhowell@xxxxxxxxxxxxxxxxxx
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster