Usually I see that message when the storage backing GFS2 fails or when a pending fence is taking a long time and DLM is blocking. Can you paste your cluster.conf please, along with log messages starting a little before the GFS2 issue began? On 24/10/13 13:38, Juan Pablo Lorier wrote: > Hi, > > I'm new to work with clusters and I've started setting up samba with HA. > For that, I've a "storage server" from supermicro with 16 2TB drives and > dual server slots that access the drives directly. > I've set a md raid 6 with the drives and created a LVM volume on top of > the raid with one lv and gfs2 partition to create a common storage for > the servers. Only one of the servers is member of the cluster right now > with Centos 6.4. >>From time to time, some users get corrupted files when they copy to the > file server and I've asked to samba list about it and they pointed to > this list for help. > The error I've found is this: > > Oct 22 17:50:04 nas kernel: INFO: task smbd:5994 blocked for more than > 120 seconds. > Oct 22 17:50:04 nas kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Oct 22 17:50:04 nas kernel: smbd D 0000000000000000 0 > 5994 4870 0x00000084 > Oct 22 17:50:04 nas kernel: ffff88017c61b948 0000000000000086 > ffff8801bb44b500 0000000000000000 > Oct 22 17:50:04 nas kernel: 00000000ffffffff 00000000ffffffff > ffff88017c61b998 ffffffff8105b4d3 > Oct 22 17:50:04 nas kernel: ffff880175b03058 ffff88017c61bfd8 > 000000000000fb88 ffff880175b03058 > Oct 22 17:50:04 nas kernel: Call Trace: > Oct 22 17:50:04 nas kernel: [<ffffffff8105b4d3>] ? > perf_event_task_sched_out+0x33/0x80 > Oct 22 17:50:04 nas kernel: [<ffffffffa0557570>] ? > gfs2_glock_holder_wait+0x0/0x20 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa055757e>] > gfs2_glock_holder_wait+0xe/0x20 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffff814fef1f>] __wait_on_bit+0x5f/0x90 > Oct 22 17:50:04 nas kernel: [<ffffffffa0557570>] ? > gfs2_glock_holder_wait+0x0/0x20 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffff814fefc8>] > out_of_line_wait_on_bit+0x78/0x90 > Oct 22 17:50:04 nas kernel: [<ffffffff81092190>] ? > wake_bit_function+0x0/0x50 > Oct 22 17:50:04 nas kernel: [<ffffffffa05594f5>] > gfs2_glock_wait+0x45/0x90 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa055a8f7>] > gfs2_glock_nq+0x237/0x3d0 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa055c5c9>] > gfs2_inode_lookup+0x129/0x300 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa054ea4d>] ? > gfs2_dirent_search+0x16d/0x1a0 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa054efbe>] > gfs2_dir_search+0x5e/0x80 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa055c32e>] gfs2_lookupi+0xde/0x1e0 > [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa0559ca8>] ? > do_promote+0x208/0x330 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa055c39d>] ? > gfs2_lookupi+0x14d/0x1e0 [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffffa0569c76>] gfs2_lookup+0x36/0xd0 > [gfs2] > Oct 22 17:50:04 nas kernel: [<ffffffff81193ed7>] ? d_alloc+0x137/0x1b0 > Oct 22 17:50:04 nas kernel: [<ffffffff81189935>] do_lookup+0x1a5/0x230 > Oct 22 17:50:04 nas kernel: [<ffffffff81189ccd>] > __link_path_walk+0x20d/0x1030 > Oct 22 17:50:04 nas kernel: [<ffffffff814a1d1a>] ? inet_recvmsg+0x5a/0x90 > Oct 22 17:50:04 nas kernel: [<ffffffff8118ad7a>] path_walk+0x6a/0xe0 > Oct 22 17:50:04 nas kernel: [<ffffffff8118af4b>] do_path_lookup+0x5b/0xa0 > Oct 22 17:50:04 nas kernel: [<ffffffff8118bbb7>] user_path_at+0x57/0xa0 > Oct 22 17:50:04 nas kernel: [<ffffffff81092150>] ? > autoremove_wake_function+0x0/0x40 > Oct 22 17:50:04 nas kernel: [<ffffffff811807ec>] vfs_fstatat+0x3c/0x80 > Oct 22 17:50:04 nas kernel: [<ffffffff8118095b>] vfs_stat+0x1b/0x20 > Oct 22 17:50:04 nas kernel: [<ffffffff81180984>] sys_newstat+0x24/0x50 > Oct 22 17:50:04 nas kernel: [<ffffffff810d6ce2>] ? > audit_syscall_entry+0x272/0x2a0 > Oct 22 17:50:04 nas kernel: [<ffffffff8100b0f2>] > system_call_fastpath+0x16/0x1b > > > > Can any one help me to debugg this? I'm trying to find out if this is a > gfs2 problem or it may be related to the hardware. I've started a gfs2 > fsck but it takes too long and I rather try to figure out the possible > cause before leaving out of production the server for more than 2 days > (that is what I'm estimating it may take according to how much it took > to partially do pass1). > Regards, > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster