Hello,
I've been having MAJOR issues with GFS2 faulting while doing extremely
simple operations. My test-bed is my /home being GFS2, storage servers
being two iSCSI targets running DRBD accross both replicating the data,
and the client side using open-iscsi to bring it in, multipath'ing both
nodes. I've done this setup as well without DRBD and multipath and just
going straight iSCSI with no difference in the problem.
What is being done to cause this error is, I have an Apache server
running my home page which is simply just Dokuwiki. Dokuwiki uses
flatfiles for everything so it uses file locking techniques on the wiki
data files.
After editing documents a few times, the same document or even mixed
documents, within a short period of time, for example within about 30
minutes, it causes GFS2 to dump a stack trace and fault and crowbar both
nodes to a complete halt.
There's 2 client servers using the same GFS2 mountpoint are KVM guests
on two physical virtual server machines; the storage servers are bare
metal iSCSI targets with DRBD replication on them over 1 GB ethernet.
The cluster glue is comprised of the clustering PPA stack for Ubuntu
10.04.1, pacemaker, dlm_controld.pcmk, gfs_controld.pcmk.
The only recovery that can be done to get even one node back online is
to forcefully take down ALL nodes at once, because gfs_controld.pcmk
will not die, dlm_controld.pcmk will not die, can't umount /home, can't
even stop open-iscsi as they just throw more stack traces and time outs.
After taking them down, I bring one node back up, setup to not
re-activate the gfs2, but to load gfs2_controld.pcmk, fsck it, and
finally re-enable the mount before bring the secondary node back online.
Obviously taking all nodes down means 100% downtime during that period
of recovery, so this completely fails.
gfs_controld.pcmk is setup to start with the command-line arguments: -g
0 -l 0 -o 0
Here's the stack traces I'm getting when it faults:
Jan 13 03:31:27 cweb1 kernel: [1387920.160141] INFO: task
flush-251:1:27497 blocked for more than 120 seconds.
Jan 13 03:31:27 cweb1 kernel: [1387920.160802] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 13 03:31:27 cweb1 kernel: [1387920.161474] flush-251:1 D
0000000000000002 0 27497 2 0x00000000
Jan 13 03:31:27 cweb1 kernel: [1387920.161479] ffff88004854fa10
0000000000000046 0000000000015bc0 0000000000015bc0
Jan 13 03:31:27 cweb1 kernel: [1387920.161483] ffff880060bf03b8
ffff88004854ffd8 0000000000015bc0 ffff880060bf0000
Jan 13 03:31:27 cweb1 kernel: [1387920.161485] 0000000000015bc0
ffff88004854ffd8 0000000000015bc0 ffff880060bf03b8
Jan 13 03:31:27 cweb1 kernel: [1387920.161488] Call Trace:
Jan 13 03:31:27 cweb1 kernel: [1387920.161508] [<ffffffffa022e730>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161516] [<ffffffffa022e73e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161540] [<ffffffff81558fcf>]
__wait_on_bit+0x5f/0x90
Jan 13 03:31:27 cweb1 kernel: [1387920.161547] [<ffffffffa022fd9d>] ?
do_promote+0xcd/0x290 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161555] [<ffffffffa022e730>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161558] [<ffffffff81559078>]
out_of_line_wait_on_bit+0x78/0x90
Jan 13 03:31:27 cweb1 kernel: [1387920.161575] [<ffffffff810843c0>] ?
wake_bit_function+0x0/0x40
Jan 13 03:31:27 cweb1 kernel: [1387920.161582] [<ffffffffa022f971>]
gfs2_glock_wait+0x31/0x40 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161590] [<ffffffffa0230975>]
gfs2_glock_nq+0x2a5/0x360 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161597] [<ffffffffa022f064>] ?
gfs2_glock_put+0x104/0x130 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161606] [<ffffffffa02497f2>]
gfs2_write_inode+0x82/0x190 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161614] [<ffffffffa02497ea>] ?
gfs2_write_inode+0x7a/0x190 [gfs2]
Jan 13 03:31:27 cweb1 kernel: [1387920.161629] [<ffffffff811661d4>]
writeback_single_inode+0x2b4/0x3d0
Jan 13 03:31:27 cweb1 kernel: [1387920.161631] [<ffffffff81166745>]
writeback_sb_inodes+0x195/0x280
Jan 13 03:31:27 cweb1 kernel: [1387920.161638] [<ffffffff81061671>] ?
dequeue_entity+0x1a1/0x1e0
Jan 13 03:31:27 cweb1 kernel: [1387920.161641] [<ffffffff81166f60>]
writeback_inodes_wb+0xa0/0x1b0
Jan 13 03:31:27 cweb1 kernel: [1387920.161643] [<ffffffff811672ab>]
wb_writeback+0x23b/0x2a0
Jan 13 03:31:27 cweb1 kernel: [1387920.161648] [<ffffffff81075f3c>] ?
lock_timer_base+0x3c/0x70
Jan 13 03:31:27 cweb1 kernel: [1387920.161651] [<ffffffff8116748c>]
wb_do_writeback+0x17c/0x190
Jan 13 03:31:27 cweb1 kernel: [1387920.161653] [<ffffffff81076050>] ?
process_timeout+0x0/0x10
Jan 13 03:31:27 cweb1 kernel: [1387920.161656] [<ffffffff811674f3>]
bdi_writeback_task+0x53/0xf0
Jan 13 03:31:27 cweb1 kernel: [1387920.161667] [<ffffffff8110e9c6>]
bdi_start_fn+0x86/0x100
Jan 13 03:31:27 cweb1 kernel: [1387920.161669] [<ffffffff8110e940>] ?
bdi_start_fn+0x0/0x100
Jan 13 03:31:27 cweb1 kernel: [1387920.161671] [<ffffffff81084006>]
kthread+0x96/0xa0
Jan 13 03:31:27 cweb1 kernel: [1387920.161680] [<ffffffff810131ea>]
child_rip+0xa/0x20
Jan 13 03:31:27 cweb1 kernel: [1387920.161683] [<ffffffff81083f70>] ?
kthread+0x0/0xa0
Jan 13 03:31:27 cweb1 kernel: [1387920.161685] [<ffffffff810131e0>] ?
child_rip+0x0/0x20
--
Eric Renfro
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster