Re: XFS journal write ordering constraints?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Apologies for the time required to reproduce, much greater than
expected. Here is a full metadump pre and post recovery:
       https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-post-recovery
       https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-pre-recovery

I also obtained a different, less common, problem, which may or may
not have the same origin; I'm hopeful it comes from the same root
cause and helps diagnose. The symptom is: xfs mount fails with
"Structure needs cleaning".
The backtrace in the kernel log is of the form

Nov 12 02:43:03 localhost kernel: [  112.897334] XFS (dm-4): Internal
error xlog_valid_rec_header(1) at line 4104 of file
fs/xfs/xfs_log_recover.c.  Caller xlog_do_recovery_pass
+0x4fc/0x5a0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897339] CPU: 0 PID: 2435
Comm: mount Tainted: P           OE  ------------
3.10.0-327.36.2.el7.x86_64 #1
Nov 12 02:43:03 localhost kernel: [  112.897342]  ffff880067cdc000
00000000fa6da3c8 ffff88003c9fbb20 ffffffff81636401
Nov 12 02:43:03 localhost kernel: [  112.897347]  ffff88003c9fbb38
ffffffffa0360e6b ffffffffa03891fc ffff88003c9fbb48
Nov 12 02:43:03 localhost kernel: [  112.897350]  ffffffffa0385458
ffff88003c9fbc50 ffffffffa03891fc ffffffff81cae082
Nov 12 02:43:03 localhost kernel: [  112.897353] Call Trace:
Nov 12 02:43:03 localhost kernel: [  112.897366]  [<ffffffff81636401>]
dump_stack+0x19/0x1b
Nov 12 02:43:03 localhost kernel: [  112.897387]  [<ffffffffa0360e6b>]
xfs_error_report+0x3b/0x40 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897409]  [<ffffffffa03891fc>]
? xlog_do_recovery_pass+0x4fc/0x5a0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897430]  [<ffffffffa0385458>]
xlog_valid_rec_header.isra.12+0xd8/0xe0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897451]  [<ffffffffa03891fc>]
xlog_do_recovery_pass+0x4fc/0x5a0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897458]  [<ffffffff8107e719>]
? vprintk_default+0x29/0x40
Nov 12 02:43:03 localhost kernel: [  112.897462]  [<ffffffff8162fe31>]
? printk+0x5e/0x75
Nov 12 02:43:03 localhost kernel: [  112.897467]  [<ffffffff811c25d3>]
? __kmalloc+0x1f3/0x230
Nov 12 02:43:03 localhost kernel: [  112.897488]  [<ffffffffa037bf57>]
? kmem_alloc+0x77/0xf0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897510]  [<ffffffffa0389309>]
xlog_do_log_recovery+0x69/0xd0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897530]  [<ffffffffa0389386>]
xlog_do_recover+0x16/0xe0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897549]  [<ffffffffa038a19f>]
xlog_recover+0x9f/0x130 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897569]  [<ffffffffa037e01c>]
xfs_log_mount+0x24c/0x2c0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897589]  [<ffffffffa037577c>]
xfs_mountfs+0x42c/0x750 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897609]  [<ffffffffa03789a5>]
xfs_fs_fill_super+0x355/0x3d0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897615]  [<ffffffff811e26c0>]
mount_bdev+0x1b0/0x1f0
Nov 12 02:43:03 localhost kernel: [  112.897634]  [<ffffffffa0378650>]
? xfs_parseargs+0xbf0/0xbf0 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897653]  [<ffffffffa03768c5>]
xfs_fs_mount+0x15/0x20 [xfs]
Nov 12 02:43:03 localhost kernel: [  112.897657]  [<ffffffff811e2e59>]
mount_fs+0x39/0x1b0

The full dump of _this_ failure is at:
       https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-pre-recovery
       https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-post-recovery


>P.S., I've also been assuming based on the log state that you haven't
used the logbufs/logbsize mount options to change from the defaults, but
please do confirm that you aren't changing either of those defaults.
Thanks.

Correct. The mkfs invocation is 'mkfs -t xfs -f -q -K -b size=1024
$device' and the mount invocation is "mount -t xfs -o discard $device
$mountpoint"

Thanks again!

On Mon, Jun 12, 2017 at 12:18 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote:
> On Mon, Jun 12, 2017 at 10:55:26AM -0400, Brian Foster wrote:
>> It looks like you've replied to one mail and copy/pasted the text from
>> another. FWIW, it's usually best to reply to each mail independently to
>> preserve threading and whatnot.
>>
>> On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote:
>> > >What is the xfs_info for this filesystem?
>> >        meta-data=/dev/mapper/tracer-vdo0 isize=256    agcount=4,
>> >        agsize=5242880 blks
>> >                 =                       sectsz=512   attr=2, projid32bit=0
>> >        data     =                       bsize=1024   blocks=20971520,
>> >        imaxpct=25
>> >                 =                       sunit=0      swidth=0 blks
>> >        naming   =version 2              bsize=4096   ascii-ci=0
>> >        log      =internal               bsize=1024   blocks=10240, version=2
>> >                 =                       sectsz=512   sunit=0 blks,
>> >        lazy-count=1
>> >        realtime =none                   extsz=4096   blocks=0, rtextents=0
>> >
>> > > What granularity are these A and B regions (sectors or larger)?
>> > A is 1k, B is 3k.
>> >
>> > >Are you running on some kind of special block device that reproduces this?
>> > It's a device we are developing,
>> > asynchronous, which we believe obeys FLUSH and FUA correctly but may
>> > have missed some case; we
>> > encountered this issue when testing an XFS filesystem on it, and other
>> > filesystems appear to work fine (although obviously we could have
>> > merely gotten lucky). Currently, when a flush returns from the device,
>> > we guarantee the data from all bios completed before the flush was
>> > issued is stably on disk; when a write+FUA bio returns from the
>> > device, the data in that bio (only) is guaranteed to be stable on disk. The
>> > device may, however, commit sequentially issued write+fua bios to disk in an
>> > arbitrary order.
>> >
>> > > Do you have a consistent reproducer and/or have you
>> > reproduced on an upstream kernel
>> > Our reproducer fails about 20% of the time. We have not tried on an
>> > upstream kernel.
>> >
>>
>> Please do try on something upstream if possible. Also, what exactly
>> triggers the failure that ultimately leads to log recovery on the next
>> mount? Are you doing an XFS shutdown or shutting down the block device
>> underneath the fs?
>>
>> > >Could you provide an xfs_metadump image of the filesystem that fails log recovery with CRC errors?
>> > I can capture such on Monday.
>> > For now, just the journal (gathered with xfs_logprint -C fsLog) can be
>> > found at (10M)
>> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/fsLog
>> > .
>>
>> Thanks. This certainly looks a bit strange. Up through sectors 19692 of
>> the log are populated with a cycle 2 LSN. After that appears to be the
>> previous cycle 1, which suggests that is the head of the log. The tail
>> then points to 1:20105, but from that point forward a strange mix of
>> cycle 1 and cycle 2 LSNs are seen.
>>
>> It's not totally clear to me from the physical log state whether this
>> was the in-core state at the time of the crash, but a first guess may be
>> that at least this is what log recovery thinks and thus complains about
>> the tail of the log being borked.
>>
>
> Another observation and possible theory... the last two cycle 2 headers
> in the log are:
>
>  20116 HEADER Cycle 2 tail 2:010284 len  32256 ops 288
>  ...
>  20180 HEADER Cycle 2 tail 2:010284 len  32256 ops 261
>
> ... which appear sanely spaced apart. They are mixed with cycle 1/2
> changes, which seems to correspond with your observation that parts of
> the 32k log buffers are succeeding and parts are failing. IOW, these log
> records are effectively incomplete on disk.
>
> That accounts for 2 of 8 log buffers. If we assume these are the "last"
> two log buffers, each log buffer is full and also spaced 64 sectors
> apart, then the first log buffer would be targetting log sector (20116 -
> 6 * 64) = 19732. That refers to the head of the on-disk log noted above:
>
>  ...
>  19628 HEADER Cycle 2 tail 1:020105 len  32256 ops 327
>  19692 HEADER Cycle 2 tail 1:020105 len  19968 ops 89
>  19732 HEADER Cycle 1 tail 1:020105 len      0 ops 0
> [00000 - 19732] Cycle 0x00000002 New Cycle 0x00000001
>  19733 HEADER Cycle 1 tail 1:020105 len      0 ops 0
>  ...
>
> So I'm wondering whether if it's possible that all of the log buffers
> were dirty at the time of the crash, the last two were the only ones
> that made it to disk, and they only made it partially at that. Further,
> in partially writing those last two bufs at 20116 and beyond, we've
> begun to overwrite the tail pointed to by the last record on disk before
> the head (tail 1:020105 -> cycle 1 sector 20105), corrupting that
> reference as well.
>
> There are also a bunch of the len == 0 headers at 19733 and beyond,
> which I'm not quite clear on. I'm wondering if log recovery might
> actually do that after it settles on a log head. That could mean there
> actually was more cycle 2 data from other log bufs in that range but it
> was all detected as invalid and cleared out during log recovery.
>
> When you grab the metadump to provide, can you make sure you do so
> _before_ you attempt log recovery on the filesystem for the first time?
> E.g., reproduce crash, umount & create metadump, mount and test for
> corruption, if reproduce -> send metadump, else toss metadump & repeat.
> This is because the first attempt at log recovery may modify the
> physical log and cause some information loss if we take the metadump
> after you've already attempted recovery.
>
> Brian
>
> P.S., I've also been assuming based on the log state that you haven't
> used the logbufs/logbsize mount options to change from the defaults, but
> please do confirm that you aren't changing either of those defaults.
> Thanks.
>
>> > A log of the journal writes can be found at (17M)
>> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/log_writes_only.blkparse.
>> > It is in a blkparse-like format. For each 512-byte sector of a bio,
>> > either starting or finishing, the data hash is recorded; the sector is
>> > recorded; and the index of this sector and the number of sectors
>> > within the current bio is recorded. Bios recorded as "FAILED" indicate
>> > that the device has crashed / become disconnected and the bio has
>> > returned with an error.
>> >
>> > >From there, it searches a previous number of blocks
>> > based on the maximum log buffer concurrency allowed by the fs to
>> > determine whether any such "holes" exist in that range. If so, the head
>> > is walked back to the first instance of such a "hole," effectively
>> > working around out of order buffer completion at the time of a
>> > filesystem crash.
>> >
>> > In the case logged and linked above, there are 256k of outstanding log
>> > write bytes at once; 187k of these fail and 69k succeed. Of the 69k which
>> > succeed, they are always the first 1k of the 4k block to which they
>> > belong. Is this within the permitted amount of outstanding log buffers?
>> >
>>
>> Generally, yes. Log recovery uses the maximum in-core log buffer count
>> (8) and the maximum supported log buffer size (256k) to determine how
>> far to look back (2M). The default log buffer size is 32k, so I suspect
>> 256k means every log buffer was in-flight at this time.
>>
>> When you reproduce this and presumably provide a metadump, could you
>> also collect the /sys/fs/xfs/<dev>/log/* state information at the time
>> of the crash of that particular metadump? Note that this directory
>> disappears after unmount so you'll have to collect it before.
>>
>> Brian
>>
>> > Thanks!
>> >
>> > Sweet Tea
>> >
>> > On Fri, Jun 9, 2017 at 7:44 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > > On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote:
>> > >> Greetings;
>> > >>
>> > >> When using XFS with a 1k block size atop our device, we regularly get
>> > >> "log record CRC mismatch"es when mounting XFS after a crash, and we
>> > >> are attempting to understand why. We are using RHEL7.3 with its kernel
>> > >> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0.
>> > >>
>> > >> Tracing indicates the following situation occurs:
>> > >>        Some pair of consecutive locations contains data A1 and B1, respectively.
>> > >>        The XFS journal issues new writes to those locations,
>> > >> containing data A2 and B2.
>> > >>        The write of B' finishes, but A' is still outstanding at the
>> > >> time of the crash.
>> > >>        Crash occurs. The data on disk is A1 and B2, respectively.
>> > >>        XFS fails to mount, complaining that the checksum mismatches.
>> > >>
>> > >> Does XFS expect sequentially issued journal IO to be committed to disk
>> > >> in the order of issuance due to the use of FUA?
>> > >
>> > > Journal IO is not sequentially issued. It's an async process. At
>> > > runtime, ordering is handled by journal IO completion processing
>> > > being queued and run in order, so IOs can both be issued and
>> > > physically complete out of order.
>> > >
>> > > Log recovery is supposed to handle this. It searches and finds the
>> > > latest contiguous journal entry and does not replay past holes that
>> > > may arise from out of order journal writes.
>> > >
>> > > CRC errors like this in recovery imply that journal writes are being
>> > > torn or not completed fully, which may mean that your storage does
>> > > not correctly implement flush/FUA ordering semantics....
>> > >
>> > > Cheers,
>> > >
>> > > Dave.
>> > > --
>> > > Dave Chinner
>> > > david@xxxxxxxxxxxxx
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux