GFS2 hangs on one node of a two node cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

this is the second time this week all my gfs2 filesystems are unaccessible on one of my two node cluster. I am using CentOS 5.3 (the evil AS ;-) ) and the cluster otherwise works well. What is used: clvmd to manage 40TB SAN storage in combination with GFS2. No other repositories except Fedora EPEL5 used. Kernel is latest (2.6.18-128.4.1.el5) and all provided software updates are in place.

The issue can be described as all gfs2 filesystems being not accessible without any error messages by user space applications, except bash complaining at logins: "-bash: cd: /san/home/USERNAME: Input/output error ". Trying to unmount the filesystems also did not succeed. Just a reboot seems to fix the issue.

The first time the problem occurred the system got a load of >800 (!!) but just because all processes were waiting for IO.

Find an excerpt of the messages that seem to be related to that issue attached. Any hints, ideas and comments highly appreciated. Thank you for your good work and help!

Marko
server kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "dps:sb01vg01lv02"
server kernel: GFS2: fsid=dps:sb01vg01lv02.1: Joined cluster. Now mounting FS...
server kernel: GFS2: fsid=dps:sb01vg01lv02.1: jid=1, already locked for use
server kernel: GFS2: fsid=dps:sb01vg01lv02.1: jid=1: Looking at journal...
server kernel: GFS2: fsid=dps:sb01vg01lv02.1: jid=1: Done
[...other non cluster/gfs related stuff removed...]]
server kernel: GFS2: fsid=dps:sb01vg01lv01.0: fatal: filesystem consistency error
server kernel: GFS2: fsid=dps:sb01vg01lv01.0:   RG = 9635245
server kernel: GFS2: fsid=dps:sb01vg01lv01.0:   function = gfs2_free_uninit_di, file = fs/gfs2/rgrp.c, line = 1653
server kernel: GFS2: fsid=dps:sb01vg01lv01.0: about to withdraw this file system
server kernel: GFS2: fsid=dps:sb01vg01lv01.0: telling LM to withdraw
server qdiskd[10309]: <warning> qdisk cycle took more than 1 second to complete (13.760000) 
server kernel: dlm: sb01vg01lv01: group leave failed -512 0
server dlm_controld[10331]: open "/sys/kernel/dlm/sb01vg01lv01/event_done" error -1 2
server kernel: GFS2: fsid=dps:sb01vg01lv01.0: withdrawn
server kernel: 
server kernel: Call Trace:
server kernel:  [<ffffffff885e252a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
server kernel:  [<ffffffff8002ce63>] wake_up_bit+0x11/0x22
server kernel:  [<ffffffff800646f6>] __down_read+0x12/0x92
server kernel:  [<ffffffff885de02c>] :gfs2:do_promote+0x108/0x137
server kernel:  [<ffffffff80063bb7>] mutex_lock+0xd/0x1d
server kernel:  [<ffffffff885e51a3>] :gfs2:buf_lo_add+0x71/0x106
server kernel:  [<ffffffff885f56bf>] :gfs2:gfs2_consist_rgrpd_i+0x34/0x39
server kernel:  [<ffffffff885f2368>] :gfs2:gfs2_free_di+0x7c/0xef
server kernel:  [<ffffffff885e1549>] :gfs2:gfs2_dinode_dealloc+0x141/0x1a7
server kernel:  [<ffffffff885edd7b>] :gfs2:gfs2_delete_inode+0xeb/0x191
server kernel:  [<ffffffff885edcd6>] :gfs2:gfs2_delete_inode+0x46/0x191
server kernel:  [<ffffffff885edc90>] :gfs2:gfs2_delete_inode+0x0/0x191
server kernel:  [<ffffffff8002f191>] generic_delete_inode+0xc6/0x143
server kernel:  [<ffffffff885f2c88>] :gfs2:gfs2_inplace_reserve_i+0x63b/0x691
server kernel:  [<ffffffff885de019>] :gfs2:do_promote+0xf5/0x137
server kernel:  [<ffffffff885e740d>] :gfs2:gfs2_write_begin+0x116/0x33f
server kernel:  [<ffffffff885e8e8f>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
server kernel:  [<ffffffff8000cdf5>] file_read_actor+0x0/0x154
server kernel:  [<ffffffff885e92c5>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
server kernel:  [<ffffffff885e9468>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f
server kernel:  [<ffffffff800c3206>] generic_file_read+0xac/0xc5
server kernel:  [<ffffffff8009dbae>] autoremove_wake_function+0x0/0x2e
server kernel:  [<ffffffff8009dbae>] autoremove_wake_function+0x0/0x2e
server kernel:  [<ffffffff885e95b8>] :gfs2:gfs2_file_write+0x49/0xa7
server kernel:  [<ffffffff80016591>] vfs_write+0xce/0x174
server kernel:  [<ffffffff80016e5e>] sys_write+0x45/0x6e
server kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
server kernel: 
server kernel: GFS2: fsid=dps:sb01vg01lv01.0: gfs2_delete_inode: -5
[...then no logs for more than two hours until i started diagnosing the issue...]
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux