Possible problem with gfs2

Juan Pablo Lorier <jplorier@xxxxxxxxx> · Thu, 24 Oct 2013 15:38:27 -0200

Hi,

I'm new to work with clusters and I've started setting up samba with HA.
For that, I've a "storage server" from supermicro with 16 2TB drives and
dual server slots that access the drives directly.
I've set a md raid 6 with the drives and created a LVM volume on top of
the raid with one lv and gfs2 partition to create a common storage for
the servers. Only one of the servers is member of the cluster right now
with Centos 6.4.
>From time to time, some users get corrupted files when they copy to the
file server and I've asked to samba list about it and they pointed to
this list for help.
The error I've found is this:

Oct 22 17:50:04 nas kernel: INFO: task smbd:5994 blocked for more than
120 seconds.
Oct 22 17:50:04 nas kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 22 17:50:04 nas kernel: smbd          D 0000000000000000     0 
5994   4870 0x00000084
Oct 22 17:50:04 nas kernel: ffff88017c61b948 0000000000000086
ffff8801bb44b500 0000000000000000
Oct 22 17:50:04 nas kernel: 00000000ffffffff 00000000ffffffff
ffff88017c61b998 ffffffff8105b4d3
Oct 22 17:50:04 nas kernel: ffff880175b03058 ffff88017c61bfd8
000000000000fb88 ffff880175b03058
Oct 22 17:50:04 nas kernel: Call Trace:
Oct 22 17:50:04 nas kernel: [<ffffffff8105b4d3>] ?
perf_event_task_sched_out+0x33/0x80
Oct 22 17:50:04 nas kernel: [<ffffffffa0557570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa055757e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffff814fef1f>] __wait_on_bit+0x5f/0x90
Oct 22 17:50:04 nas kernel: [<ffffffffa0557570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffff814fefc8>]
out_of_line_wait_on_bit+0x78/0x90
Oct 22 17:50:04 nas kernel: [<ffffffff81092190>] ?
wake_bit_function+0x0/0x50
Oct 22 17:50:04 nas kernel: [<ffffffffa05594f5>]
gfs2_glock_wait+0x45/0x90 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa055a8f7>]
gfs2_glock_nq+0x237/0x3d0 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa055c5c9>]
gfs2_inode_lookup+0x129/0x300 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa054ea4d>] ?
gfs2_dirent_search+0x16d/0x1a0 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa054efbe>]
gfs2_dir_search+0x5e/0x80 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa055c32e>] gfs2_lookupi+0xde/0x1e0
[gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa0559ca8>] ?
do_promote+0x208/0x330 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa055c39d>] ?
gfs2_lookupi+0x14d/0x1e0 [gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffffa0569c76>] gfs2_lookup+0x36/0xd0
[gfs2]
Oct 22 17:50:04 nas kernel: [<ffffffff81193ed7>] ? d_alloc+0x137/0x1b0
Oct 22 17:50:04 nas kernel: [<ffffffff81189935>] do_lookup+0x1a5/0x230
Oct 22 17:50:04 nas kernel: [<ffffffff81189ccd>]
__link_path_walk+0x20d/0x1030
Oct 22 17:50:04 nas kernel: [<ffffffff814a1d1a>] ? inet_recvmsg+0x5a/0x90
Oct 22 17:50:04 nas kernel: [<ffffffff8118ad7a>] path_walk+0x6a/0xe0
Oct 22 17:50:04 nas kernel: [<ffffffff8118af4b>] do_path_lookup+0x5b/0xa0
Oct 22 17:50:04 nas kernel: [<ffffffff8118bbb7>] user_path_at+0x57/0xa0
Oct 22 17:50:04 nas kernel: [<ffffffff81092150>] ?
autoremove_wake_function+0x0/0x40
Oct 22 17:50:04 nas kernel: [<ffffffff811807ec>] vfs_fstatat+0x3c/0x80
Oct 22 17:50:04 nas kernel: [<ffffffff8118095b>] vfs_stat+0x1b/0x20
Oct 22 17:50:04 nas kernel: [<ffffffff81180984>] sys_newstat+0x24/0x50
Oct 22 17:50:04 nas kernel: [<ffffffff810d6ce2>] ?
audit_syscall_entry+0x272/0x2a0
Oct 22 17:50:04 nas kernel: [<ffffffff8100b0f2>]
system_call_fastpath+0x16/0x1b

Can any one help me to debugg this? I'm trying to find out if this is a
gfs2 problem or it may be related to the hardware. I've started a gfs2
fsck but it takes too long and I rather try to figure out the possible
cause before leaving out of production the server for more than 2 days
(that is what I'm estimating it may take according to how much it took
to partially do pass1).
Regards,

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster