Re: gfs2 filesystem crash with no recovery?

"Douglas O'Neal" <oneal@xxxxxxxxxxxx> · Thu, 18 Mar 2010 14:29:29 -0400

On 03/18/2010 10:04 AM, Steven Whitehouse wrote:
Hi,

On Thu, 2010-03-18 at 09:18 -0400, Douglas O'Neal wrote:

On 03/15/2010 09:55 AM, Douglas O'Neal wrote:

I have a problem with a gfs2 filesystem that is (was) being mounted 
from a single host.  The system appeared to have hung over the weekend 
so I unmounted and remounted the disk.  After a couple of minutes I 
received this in the kernel logs:

Mar 15 08:28:50 localhost kernel: GFS2: fsid=: Trying to join cluster 
"lock_nolock", "sde1"
Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: Now mounting FS...
Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: jid=0, already 
locked for use
Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: jid=0: Looking at 
journal...
Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: jid=0: Done
Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0: fatal: invalid 
metadata block
Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0:   bh = 4294972166 
(type: exp=3, found=2)
Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0:   function = 
gfs2_rgrp_bh_get, file = fs/gfs2/rgrp.c, line = 759
Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0: about to withdraw 
this file system
Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0: withdrawn
Mar 15 08:43:37 localhost kernel: Pid: 3687, comm: cp Not tainted 
2.6.32-gentoo-r7 #2
Mar 15 08:43:37 localhost kernel: Call Trace:
Mar 15 08:43:37 localhost kernel: [<ffffffffa03b285d>] ? 
gfs2_lm_withdraw+0x12d/0x160 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffff813bf22b>] ? 
io_schedule+0x4b/0x70
Mar 15 08:43:37 localhost kernel: [<ffffffff810cc560>] ? 
sync_buffer+0x0/0x50
Mar 15 08:43:37 localhost kernel: [<ffffffff813bf7a9>] ? 
out_of_line_wait_on_bit+0x79/0xa0
Mar 15 08:43:37 localhost kernel: [<ffffffff8104e740>] ? 
wake_bit_function+0x0/0x30
Mar 15 08:43:37 localhost kernel: [<ffffffff810cb162>] ? 
submit_bh+0x112/0x140
Mar 15 08:43:37 localhost kernel: [<ffffffffa03b2947>] ? 
gfs2_metatype_check_ii+0x47/0x60 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa03ae40b>] ? 
gfs2_rgrp_bh_get+0x1db/0x300 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa0397d86>] ? 
do_promote+0x116/0x200 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa03992a5>] ? 
finish_xmote+0x1a5/0x3a0 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa0398fcd>] ? 
do_xmote+0xfd/0x230 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa039986d>] ? 
gfs2_glock_nq+0x13d/0x320 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa03aea2d>] ? 
gfs2_inplace_reserve_i+0x1ed/0x7b0 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa0399581>] ? 
run_queue+0xe1/0x210 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa039986d>] ? 
gfs2_glock_nq+0x13d/0x320 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffffa03a1f92>] ? 
gfs2_write_begin+0x272/0x480 [gfs2]
Mar 15 08:43:37 localhost kernel: [<ffffffff8106df04>] ? 
generic_file_buffered_write+0x114/0x290
Mar 15 08:43:37 localhost kernel: [<ffffffff8106e4a8>] ? 
__generic_file_aio_write+0x278/0x450
Mar 15 08:43:37 localhost kernel: [<ffffffff8106e6d5>] ? 
generic_file_aio_write+0x55/0xb0
Mar 15 08:43:37 localhost kernel: [<ffffffff810a6a1b>] ? 
do_sync_write+0xdb/0x120
Mar 15 08:43:37 localhost kernel: [<ffffffff8104e710>] ? 
autoremove_wake_function+0x0/0x30
Mar 15 08:43:37 localhost kernel: [<ffffffff8108511f>] ? 
handle_mm_fault+0x1bf/0x850
Mar 15 08:43:37 localhost kernel: [<ffffffff8108b5cc>] ? 
mmap_region+0x23c/0x5d0
Mar 15 08:43:37 localhost kernel: [<ffffffff810a752b>] ? 
vfs_write+0xcb/0x160
Mar 15 08:43:37 localhost kernel: [<ffffffff810a76c3>] ? 
sys_write+0x53/0xa0
Mar 15 08:43:37 localhost kernel: [<ffffffff8100b2ab>] ? 
system_call_fastpath+0x16/0x1b

I again unmounted the disk but now when I try to fsck the filesystem I 
get:
urania# fsck.gfs2 -v /dev/sde1
Initializing fsck
Initializing lists...
Either the super block is corrupted, or this is not a GFS2 filesystem

The server is a running kernel 2.6.32, 64-bit.  The array is a 
Jetstore 516iS with a single 28TB iSCSI volume defined.  The relevant 
line from the fstab is
/dev/sde1        /illumina    gfs2    _netdev,rw,lockproto=lock_nolock

gfs2_tool isn't much help, nor is gfs2_edit:
urania# gfs2_tool sb /dev/sde1 all
/usr/src/cluster-3.0.7/gfs2/tool/../libgfs2/libgfs2.h: there isn't a 
GFS2 filesystem on /dev/sde1
urania# gfs2_edit -p sb /dev/sde1
bad seek: Invalid argument from gfs2_load_inode:416: block 
3747350044811107074 (0x34014302ee029b02)

Is there an alternate superblock that I can use to mount the disk to 
at least get the last couple of days of data off of it?

Anybody?

What version of the userland tools are you using? There has been an
update recently to fsck designed to solve a number of problems. I've
never seen a filesystem which is so badly corrupted that the super block
is unrecognisable before. The super block is not ever altered during
normal fs usage.

Are you 100% certain that this volume was not being accessed by another
node on the network?

If you can save off the metadata then we can take a look at it. That
might not be possible with a corrupt superblock though, so an
alternative is to make it available somehow for us to look at,

Steve.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

Userland tools 3.0.7. The iSCSI array is on a closed network and is 
protected by a CHAP login. No other system has been configured to access 
the array. I have the first 1MB of the disk available at 
http://urania.dbi.udel.edu/sde.block.bz2 if you want to see the actual 
data. gfs2_edit will not pull the metadata off:

urania ~ # gfs2_edit savemeta /dev/sde /tmp/metasave
Segmentation fault

Doug

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster