RE: Possible XFS bug encountered in 3.14.0-rc3+

"Mears, Morgan" <Morgan.Mears@xxxxxxxxxx> · Thu, 13 Mar 2014 16:56:37 +0000

On Thu, Mar 13, 2014 at 11:32:22PM -0400, Ben Myers wrote:
> On Thu, Mar 13, 2014 at 02:47:58PM +0000, Mears, Morgan wrote:
>> On Wed, Mar 12, 2014 at 07:03:14PM -0400, Dave Chinner wrote:
>> > On Wed, Mar 12, 2014 at 08:14:32PM +0000, Mears, Morgan wrote:
>> >> Hi,
>> >> 
>> >> Please CC me on any responses; I don't subscribe to this list.
>> >> 
>> >> I ran into a possible XFS bug while doing some Oracle benchmarking.  My test
>> >> system is running a 3.14.0-rc3+ kernel built from the for-next branch of
>> >> git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git
>> >> on 2014-02-19 (last commit 1342f11e713792e53e4b7aa21167fe9caca81c4a).
>> >> 
>> >> The XFS instance in question is 200 GB and should have all default
>> >> parameters (mkfs.xfs /dev/mapper/<my_lun_partition>).  It contains Oracle
>> >> binaries and trace files.  At the time the issue occurred I had been
>> >> running Oracle with SQL*NET server tracing enabled.  The affected XFS
>> >> had filled up 100% with trace files several times; I was periodically
>> >> executing rm -f * in the trace file directory, which would reduce the
>> >> file system occupancy from 100% to 3%.  I had an Oracle load generating
>> >> tool running, so new log files were being created with some frequency.
>> >> 
>> >> The issue occurred during one of my rm -f * executions; afterwards the
>> >> file system would only produce errors.  Here is the traceback:
>> >> 
>> >> [1552067.297192] XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1602 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffffa04c4905
>> > 
>> > So, freeing a range that is already partially free. The problem
>> > appears to be in AG 15, according to the repair output.
>> > 
>> >> https://dl.dropboxusercontent.com/u/31522929/xfs-double-free-xfs_metadump-before-repair.gz
>> > 
>> > AGF 15 is full:
>> > 
>> > xfs_db> agf 15
>> > xfs_db> p
>> > magicnum = 0x58414746
>> > versionnum = 1
>> > seqno = 15
>> > length = 3276783
>> > bnoroot = 1
>> > cntroot = 2
>> > bnolevel = 1
>> > cntlevel = 1
>> > flfirst = 0
>> > fllast = 3
>> > flcount = 4
>> > freeblks = 1
>> > longest = 1
>> > btreeblks = 0
>> > uuid = 00000000-0000-0000-0000-000000000000
>> > lsn = 0
>> > crc = 0
>> > 
>> > And the one free block (other than the minimum 4 on teh AGFL) is:
>> > 
>> > xfs_db> p
>> > magic = 0x41425442
>> > level = 0
>> > numrecs = 1
>> > leftsib = null
>> > rightsib = null
>> > recs[1] = [startblock,blockcount] 1:[3119876,1]
>> > 
>> > But:
>> > 
>> > data fork in ino 940862056 claims dup extent, off - 11, start - 58836692, cnt 1
>> > correcting nextents for inode 940862056
>> > bad data fork in inode 940862056
>> > would have cleared inode 940862056
>> > 
>> > the block number here is in AG 14, which has much more free space:
>> > 
>> > xfs_db> p
>> > magicnum = 0x58414746
>> > versionnum = 1
>> > seqno = 14
>> > length = 3276783
>> > bnoroot = 1
>> > cntroot = 2
>> > bnolevel = 1
>> > cntlevel = 1
>> > flfirst = 42
>> > fllast = 45
>> > flcount = 4
>> > freeblks = 2092022
>> > longest = 2078090
>> > btreeblks = 0
>> > 
>> > which is in 2 extents:
>> > 
>> > xfs_db> a bnoroot
>> > xfs_db> p
>> > magic = 0x41425442
>> > level = 0
>> > numrecs = 2
>> > leftsib = null
>> > rightsib = null
>> > recs[1-2] = [startblock,blockcount] 1:[102466,13932] 2:[116476,2078090]
>> > xfs_db> convert agno 14 agbno 102466 fsb
>> > 0x3819042 (58822722)
>> > xfs_db> convert agno 14 agbno 116476 fsb
>> > 0x381c6fc (58836732)
>> > 
>> > and so 58836692 is just short of the second free space. Looking at
>> > all the other dup extent claims, they a remostly adjacent to the
>> > left edge of these two free spaces. No surprise - that's the way
>> > allocation occurs.
>> > 
>> > So, we've got a state where the allocation btree contains a
>> > corruption, so a shutdown occurs. The log has captured that
>> > corruption when it was made, so log recovery reintroduces that
>> > corruption. And so when the extent is freed after log recovery, the
>> > corruption is tripped over again.
>> > 
>> > There's two checkpoints in the log, both very small. The last
>> > modification to AGI 14 is there before it:
>> > 
>> > BUF: cnt:2 total:2 a:0x669350 len:24 a:0x6693d0 len:128 
>> >	     BUF:  #regs:2   start blkno:0x15dff891   len:1   bmap size:1   flags:0x2800
>> >	     AGF Buffer: (XAGF)
>> >		 ver:1  seq#:14  len:3276783  
>> >		 root BNO:1  CNT:2
>> >		 level BNO:1  CNT:1
>> >		 1st:42  last:45  cnt:4  freeblks:2092020  longest:2078090
>> > 
>> > As is the freespace btree buffer modification:
>> > 
>> > BUF: cnt:2 total:2 a:0x669460 len:24 a:0x6694e0 len:128 
>> >	     BUF:  #regs:2   start blkno:0x15dff898   len:8   bmap size:1   flags:0x2000
>> >	     BUF DATA
>> >  0 42544241  4000000 ffffffff ffffffff cb620100  1000000 fc7c0100  2000000 
>> >	magic = BTBA
>> >	level = 0
>> >	numrecs = 4
>> >	leftsib = NULLFSBLOCK
>> >	rightsib = NULLFSBLOCK
>> >	rec[0] = 0x162cb, 1 (90827,1)
>> >	rec[1] = 0x17cfc, 2 (97532,1)
>> >  8 47900100 67360000 fcc60100 8ab51f00 71ff3100  5000000 71ff3100  5000000 
>> >	rec[2] = 0x19047, 0x3667 (102471,13927)
>> >	rec[3] = 0x1c6fc, 0x1fb58a (116476,2078090)
>> > 
>> > so, from the pre-recovery case above, we've got two new freespaces
>> > in rec[0-1], rec[2] has 5 blocks removed from the left edge, and
>> > rec[3] is unchanged.
>> > 
>> > Confirming the ABTC buffer contains the same extents:
>> > 
>> > BUF: cnt:2 total:2 a:0x669570 len:24 a:0x6695f0 len:128 
>> >	     BUF:  #regs:2   start blkno:0x15dff8a0   len:8   bmap size:1   flags:0x2000
>> >	     BUF DATA
>> >  0 43544241  4000000 ffffffff ffffffff cb620100  1000000 fc7c0100  2000000 
>> >	magic = CTBA
>> >	level = 0
>> >	numrecs = 4
>> >	leftsib = NULLFSBLOCK
>> >	rightsib = NULLFSBLOCK
>> >	rec[0] = 0x162cb, 1 (90827,1)
>> >	rec[1] = 0x17cfc, 2 (97532,1)
>> >  8 47900100 67360000 fcc60100 8ab51f00 71ff3100  5000000 71ff3100  5000000 
>> >	rec[2] = 0x19047, 0x3667 (102471,13927)
>> >	rec[3] = 0x1c6fc, 0x1fb58a (116476,2078090)
>> > 10  8ca0100  8000000 71ff3100  5000000 71ff3100  5000000 71ff3100  5000000 
>> > 18 71ff3100  5000000 71ff3100  5000000 71ff3100  5000000 71ff3100  5000000 
>> > 
>> > It does. So the btrees contain consistent information, and so it's
>> > unlikely that we have a free space btree corruption in the log. So
>> > let's look at what was freed:
>> > 
>> > The EFI/EFDs in the log are:
>> > 
>> > EFI: cnt:1 total:1 a:0x668670 len:32 
>> >	     EFI:  #regs:1    num_extents:1  id:0xffff881496024af0
>> >	     (s: 0x38162cb, l: 1) 
>> > EFD: cnt:1 total:1 a:0x6684d0 len:32 
>> >	     EFD:  #regs: 1    num_extents: 1  id: 0xffff881496024af0
>> > ....
>> > EFI: cnt:1 total:1 a:0x667da0 len:32 
>> >	     EFI:  #regs:1    num_extents:1  id:0xffff8814960244b0
>> >	     (s: 0x3817cfc, l: 1) 
>> > EFD: cnt:1 total:1 a:0x65fbd0 len:32 
>> >	     EFD:  #regs: 1    num_extents: 1  id: 0xffff8814960244b0
>> > ....
>> > EFI: cnt:1 total:1 a:0x669250 len:32 
>> >	     EFI:  #regs:1    num_extents:1  id:0xffff881496024000
>> >	     (s: 0x3817cfd, l: 1) 
>> > EFD: cnt:1 total:1 a:0x6692d0 len:32 
>> >	     EFD:  #regs: 1    num_extents: 1  id: 0xffff881496024000
>> > 
>> > These are all in ag#14, and have agbnos of 0x162cb, 0x17cfc and
>> > 0x17cfd. This corresponds exactly to the two new records in the AG
>> > btree. The one remaining EFI doesn't have an EFD in th elog, so this
>> > is what is being replayed:
>> > 
>> > EFI: cnt:1 total:1 a:0x669f40 len:32 
>> >	     EFI:  #regs:1    num_extents:1  id:0xffff881496024640
>> >   (s: 0x3869ff9, l: 1) 
>> > 
>> > Which is at agbno 0x69ff9 (434169). That's off to the right of the
>> > start of the last freespace range in rec[3]. rec[3] starts at
>> > 0x1c6fc and ends at 0x1fb58a. That means this efi clearly lands
>> > inside that freespace range - it's not an edge case, it's deep in
>> > the interior of the freespace range.
>> > 
>> > The inode logged just before the EFI - likely it's owner:
>> > 
>> > INO: cnt:2 total:2 a:0x668040 len:56 a:0x667fb0 len:96 
>> >	     INODE: #regs:2   ino:0x3862d698  flags:0x1   dsize:0
>> >	     CORE inode:
>> >		 magic:IN  mode:0x81b0  ver:2  format:2  onlink:0
>> >		 uid:1001  gid:1001  nlink:0 projid:0
>> >		 atime:1394495104  mtime:1394495104  ctime:1394554526
>> >		 flushiter:1
>> >		 size:0x0  nblks:0x0  exsize:0  nextents:0  anextents:0
>> >		 forkoff:13  dmevmask:0x0  dmstate:0  flags:0x0  gen:-121998876
>> > 
>> > Is an unlinked inode that has had all it's block removed. Yup - it's
>> > on the unlinked list:
>> > 
>> > agi unlinked bucket 24 is 6477464 in ag 14 (inode=946001560)
>> > 
>> > So, prior to recovery, what did it contain? it's got 287 bytes of
>> > date, and a single extent:
>> > 
>> > u.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,59154425,1,0]
>> > 
>> > xfs_db> convert fsb 59154425 agno
>> > 0xe (14)
>> > xfs_db> convert fsb 59154425 agbno
>> > 0x69ff9 (434169)
>> > 
>> > Ok, so the corruption, whatever it was, happened a long time ago,
>> > and it's only when removing the file that it was tripped over.
>> > There's nothing more I can really get from this - the root cause of
>> > the corruption is long gone.
>> > 
>> > Cheers,
>> > 
>> > Dave.
>> > -- 
>> > Dave Chinner
>> > david@xxxxxxxxxxxxx
>> 
>> Thanks Dave.
>> 
>> Upon restarting my testing I immediately hit this error again (or a very
>> similar one in any case).  I suspect that the corruption you've noted was
>> not properly repaired by xfs_repair.
> 
> There are some kinds of corruption that xfs_repair was finding but not
> repairing until commit ea4a8de1e1.  I suggest you upgrade if you don't have
> this commit.  How long ago did you make this filesystem?
> 
> Thanks,
> 	Ben

I'm using the xfs_repair that came with RHEL 7.0 post-beta snapshot 4; -V
says "xfs_repair version 3.2.0-alpha2".  I don't think I have access to the
commit at which it was built.  I can certainly try the latest version,
though.

I made this file system on February 17th, 2014.  I happen to remember
because I was disappointed to be working on Presidents' Day :)

--Morgan

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs