correct procedure for mismatched UUIDs (error 117)

Vincent McIntyre <vincent.mcintyre@xxxxxxxx> · Tue, 8 Mar 2011 13:24:51 +1100

Hi,

I had a problem with an xfs filesystem that somehow ended up with
a mismatch between the UUID recorded in the superblock and the log.

My question is - what would have been the correct procedure here?
I know this should "never happen". But it has, in an extreme corner
case, and I'd be interested to know if there was anything different
we could have done. (Besides mounting by UUID in the first place...)

Here's what we did.

The platform is Debian Lenny, 64-bit.
% uname -a
Linux debian 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux
% dpkg -l|grep xfs
ii  xfsdump                              2.2.48-1                    Administrative utilities for the XFS filesystem
ii  xfsprogs                             2.9.8-1lenny1               Utilities for managing the XFS filesystem

We are using multipath-tools to address the storage.
% dpkg -l |grep multipath
ii  multipath-tools                      0.4.8-14+lenny2             maintain multipath block device access
ii  multipath-tools-boot                 0.4.8-14+lenny2             Support booting from multipath devices

We've used this successfully before, with the same combination
of storage (Promise Vtrak E610f) and fibre channel switch (QLogic SB5202).
The filesystems were both whole-disk partitions on 9.6Tb disks.

What we think caused the problem was:
 * we are using the user-friendly names feature of multipath-tools
 * we changed the binding between userfriendly name and WWN
   for two filesystems - just swapped the mapping of two
 * we omitted to also change the mount path in /etc/fstab.
Silly us.

Things seemed ok until we tried to 'ls' one of the filesystems;
then we got a stack trace:
 Filesystem "dm-20": XFS internal error xfs_da_do_buf(2) at line 2085 of file fs/xfs/xfs_da_btree.c.  Caller 0xffffffffa027c48b
 Pid: 8687, comm: ls Not tainted 2.6.26-2-amd64 #1

 Call Trace:
  [<ffffffffa027c48b>] :xfs:xfs_da_read_buf+0x24/0x29
  [<ffffffffa027c339>] :xfs:xfs_da_do_buf+0x54e/0x636
  [<ffffffffa027c48b>] :xfs:xfs_da_read_buf+0x24/0x29
  [<ffffffff80276543>] get_page_from_freelist+0x45a/0x606
  [<ffffffffa027c48b>] :xfs:xfs_da_read_buf+0x24/0x29
  [<ffffffffa027f471>] :xfs:xfs_dir2_block_getdents+0x77/0x1b6
  [<ffffffffa027f471>] :xfs:xfs_dir2_block_getdents+0x77/0x1b6
  [<ffffffffa02abf88>] :xfs:xfs_hack_filldir+0x0/0x5b
  [<ffffffffa02abf88>] :xfs:xfs_hack_filldir+0x0/0x5b
  [<ffffffffa027e5ae>] :xfs:xfs_readdir+0x90/0xb5
  [<ffffffff802a6ed4>] filldir+0x0/0xb7
  [<ffffffffa02abf3b>] :xfs:xfs_file_readdir+0xff/0x14c
  [<ffffffff802a6ed4>] filldir+0x0/0xb7
  [<ffffffff802a6ed4>] filldir+0x0/0xb7
  [<ffffffff802a7000>] vfs_readdir+0x75/0xa7
  [<ffffffff802a7250>] sys_getdents+0x75/0xbd
  [<ffffffff8042ab79>] error_exit+0x0/0x60
  [<ffffffff8020beda>] system_call_after_swapgs+0x8a/0x8f

Syslog shows that before that the device mounted cleanly:
 Filesystem "dm-20": Disabling barriers, not supported by the underlying device
 XFS mounting filesystem dm-20
 Ending clean XFS mount for filesystem: dm-20
We only saw a problem when we tried to access it.

Once we saw the ls failure we stopped and changed the mount paths for
the affected filesystems in fstab, then rebooted.
During boot, we got:
 XFS mounting filesystem dm-13
 XFS: log has mismatched uuid - can't recover
 XFS: failed to find log head
 XFS: log mount/recovery failed: error 117
 XFS: log mount failed

for both of the filesystems.

We tried to revert the binding change but that didn't get us out of jail.
First we commented out the affected filesystems in /etc/fstab, rebooted.
When we tried to mount manually after checking the /dev/mapper paths
were what we thought they should be, we still got complaints about
mismatching UUIDs.

We ran xfs_check on both filesystems in turn.

We ran xfs_metadump, which ran w/o errors but did not seem to help us much.

Then we ran xfs_repair in -n mode on each filesystem.
Looked a bit scary, so we deferred using it.

We ran xfs_admin -u on each filesystem, which told us what we already knew:
 # xfs_admin -u /dev/mapper/mpath0-part1
 warning: UUID in AG 1 differs to the primary SB
 UUID = bd57b07f-2f07-4cb3-a641-9f3ecf72ce26
 # xfs_admin -u /dev/mapper/mpath1-part1
 warning: UUID in AG 1 differs to the primary SB
 UUID = 118e731c-aca8-4c78-99d4-df297258dd63

We tried mounting with -oro,nouuid,norecovery, but that didn't help:
 # mount -oro,nouuid,norecovery /dev/mapper/mpath0-part1 /recover
 # ls /recover/
 # ls: reading directory /recover/: Structure needs cleaning
 # umount /recover

We tried xfs_logprint - the log had the same uuid in all the entries
that were printed out. This did not match the uuid of the SB.

By now we were running low on time, so we tried xfs_repair.
We tried one filesystem with -L and one without.
The former produced the expected jumble of inode-numbered files,
which we are in the process of piecing together.
The latter seemed to preserve the directory structure a bit better,
though there was still some jumbling-up.
I won't tax you with the full logs.

That's the story. Opinions?
Vince

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs