8 node cluster, fiber channel hbas and disks access trough a qlogic fabric.
I've got hit 3 times with this error on different nodes :
GFS2: fsid=CyberCluster:GizServer.1: fatal: filesystem consistency error
GFS2: fsid=CyberCluster:GizServer.1: inode = 9582 6698267
GFS2: fsid=CyberCluster:GizServer.1: function = gfs2_dinode_dealloc, file =
fs/gfs2/inode.c, line = 352
GFS2: fsid=CyberCluster:GizServer.1: about to withdraw this file system
GFS2: fsid=CyberCluster:GizServer.1: telling LM to unmount
GFS2: fsid=CyberCluster:GizServer.1: withdrawn
Pid: 2659, comm: delete_workqueu Tainted: G W ---------------- T
2.6.32-131.2.1.el6.x86_64 #1
Call Trace:
[<ffffffffa044ffd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
[<ffffffffa0425209>] ? trunc_dealloc+0xa9/0x130 [gfs2]
[<ffffffffa04501dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2]
[<ffffffffa0435584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2]
[<ffffffffa044e1da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2]
[<ffffffffa044e0ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2]
[<ffffffffa044e020>] ? gfs2_delete_inode+0x0/0x280 [gfs2]
[<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0
[<ffffffffa0432940>] ? delete_work_func+0x0/0x80 [gfs2]
[<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80
[<ffffffffa044cc4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2]
[<ffffffff8118bf82>] ? iput+0x62/0x70
[<ffffffffa0432994>] ? delete_work_func+0x54/0x80 [gfs2]
[<ffffffff810887d0>] ? worker_thread+0x170/0x2a0
[<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81088660>] ? worker_thread+0x0/0x2a0
[<ffffffff8108dd96>] ? kthread+0x96/0xa0
[<ffffffff8100c1ca>] ? child_rip+0xa/0x20
[<ffffffff8108dd00>] ? kthread+0x0/0xa0
[<ffffffff8100c1c0>] ? child_rip+0x0/0x20
no_formal_ino = 9582
no_addr = 6698267
i_disksize = 6838
blocks = 0
i_goal = 6698304
i_diskflags = 0x00000000
i_height = 1
i_depth = 0
i_entries = 0
i_eattr = 0
GFS2: fsid=CyberCluster:GizServer.1: gfs2_delete_inode: -5
gdlm_unlock 5,66351b err=-22
Only, with different inodes each time.
After that event, services running on that filesystem are marked failed and
not moved over another node. Any access to that fs yields I/O error. Server
needed to be rebooted to properly work again.
I did ran a fsck last night on that filesystem, and it did find some errors,
but nothing serious. Lots (realy lots) of those :
Ondisk and fsck bitmaps differ at block 5771602 (0x581152)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Fix bitmap for block 5771602 (0x581152) ? (y/n)
And after completing the fsck, I started back some services, and I got the
same error on another filesystem that is practily empty and used for small
utilities used troughout the cluster...
What should I do to find the source of this problem ?
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster