Apologies for the time required to reproduce, much greater than expected. Here is a full metadump pre and post recovery: https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-post-recovery https://s3.amazonaws.com/permabit-development/20170615-xfsUpload/metadump-pre-recovery I also obtained a different, less common, problem, which may or may not have the same origin; I'm hopeful it comes from the same root cause and helps diagnose. The symptom is: xfs mount fails with "Structure needs cleaning". The backtrace in the kernel log is of the form Nov 12 02:43:03 localhost kernel: [ 112.897334] XFS (dm-4): Internal error xlog_valid_rec_header(1) at line 4104 of file fs/xfs/xfs_log_recover.c. Caller xlog_do_recovery_pass +0x4fc/0x5a0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897339] CPU: 0 PID: 2435 Comm: mount Tainted: P OE ------------ 3.10.0-327.36.2.el7.x86_64 #1 Nov 12 02:43:03 localhost kernel: [ 112.897342] ffff880067cdc000 00000000fa6da3c8 ffff88003c9fbb20 ffffffff81636401 Nov 12 02:43:03 localhost kernel: [ 112.897347] ffff88003c9fbb38 ffffffffa0360e6b ffffffffa03891fc ffff88003c9fbb48 Nov 12 02:43:03 localhost kernel: [ 112.897350] ffffffffa0385458 ffff88003c9fbc50 ffffffffa03891fc ffffffff81cae082 Nov 12 02:43:03 localhost kernel: [ 112.897353] Call Trace: Nov 12 02:43:03 localhost kernel: [ 112.897366] [<ffffffff81636401>] dump_stack+0x19/0x1b Nov 12 02:43:03 localhost kernel: [ 112.897387] [<ffffffffa0360e6b>] xfs_error_report+0x3b/0x40 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897409] [<ffffffffa03891fc>] ? xlog_do_recovery_pass+0x4fc/0x5a0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897430] [<ffffffffa0385458>] xlog_valid_rec_header.isra.12+0xd8/0xe0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897451] [<ffffffffa03891fc>] xlog_do_recovery_pass+0x4fc/0x5a0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897458] [<ffffffff8107e719>] ? vprintk_default+0x29/0x40 Nov 12 02:43:03 localhost kernel: [ 112.897462] [<ffffffff8162fe31>] ? printk+0x5e/0x75 Nov 12 02:43:03 localhost kernel: [ 112.897467] [<ffffffff811c25d3>] ? __kmalloc+0x1f3/0x230 Nov 12 02:43:03 localhost kernel: [ 112.897488] [<ffffffffa037bf57>] ? kmem_alloc+0x77/0xf0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897510] [<ffffffffa0389309>] xlog_do_log_recovery+0x69/0xd0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897530] [<ffffffffa0389386>] xlog_do_recover+0x16/0xe0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897549] [<ffffffffa038a19f>] xlog_recover+0x9f/0x130 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897569] [<ffffffffa037e01c>] xfs_log_mount+0x24c/0x2c0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897589] [<ffffffffa037577c>] xfs_mountfs+0x42c/0x750 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897609] [<ffffffffa03789a5>] xfs_fs_fill_super+0x355/0x3d0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897615] [<ffffffff811e26c0>] mount_bdev+0x1b0/0x1f0 Nov 12 02:43:03 localhost kernel: [ 112.897634] [<ffffffffa0378650>] ? xfs_parseargs+0xbf0/0xbf0 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897653] [<ffffffffa03768c5>] xfs_fs_mount+0x15/0x20 [xfs] Nov 12 02:43:03 localhost kernel: [ 112.897657] [<ffffffff811e2e59>] mount_fs+0x39/0x1b0 The full dump of _this_ failure is at: https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-pre-recovery https://s3.amazonaws.com/permabit-development/20170613-xfsUpload/metadump-post-recovery >P.S., I've also been assuming based on the log state that you haven't used the logbufs/logbsize mount options to change from the defaults, but please do confirm that you aren't changing either of those defaults. Thanks. Correct. The mkfs invocation is 'mkfs -t xfs -f -q -K -b size=1024 $device' and the mount invocation is "mount -t xfs -o discard $device $mountpoint" Thanks again! On Mon, Jun 12, 2017 at 12:18 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote: > On Mon, Jun 12, 2017 at 10:55:26AM -0400, Brian Foster wrote: >> It looks like you've replied to one mail and copy/pasted the text from >> another. FWIW, it's usually best to reply to each mail independently to >> preserve threading and whatnot. >> >> On Fri, Jun 09, 2017 at 10:06:26PM -0400, Sweet Tea Dorminy wrote: >> > >What is the xfs_info for this filesystem? >> > meta-data=/dev/mapper/tracer-vdo0 isize=256 agcount=4, >> > agsize=5242880 blks >> > = sectsz=512 attr=2, projid32bit=0 >> > data = bsize=1024 blocks=20971520, >> > imaxpct=25 >> > = sunit=0 swidth=0 blks >> > naming =version 2 bsize=4096 ascii-ci=0 >> > log =internal bsize=1024 blocks=10240, version=2 >> > = sectsz=512 sunit=0 blks, >> > lazy-count=1 >> > realtime =none extsz=4096 blocks=0, rtextents=0 >> > >> > > What granularity are these A and B regions (sectors or larger)? >> > A is 1k, B is 3k. >> > >> > >Are you running on some kind of special block device that reproduces this? >> > It's a device we are developing, >> > asynchronous, which we believe obeys FLUSH and FUA correctly but may >> > have missed some case; we >> > encountered this issue when testing an XFS filesystem on it, and other >> > filesystems appear to work fine (although obviously we could have >> > merely gotten lucky). Currently, when a flush returns from the device, >> > we guarantee the data from all bios completed before the flush was >> > issued is stably on disk; when a write+FUA bio returns from the >> > device, the data in that bio (only) is guaranteed to be stable on disk. The >> > device may, however, commit sequentially issued write+fua bios to disk in an >> > arbitrary order. >> > >> > > Do you have a consistent reproducer and/or have you >> > reproduced on an upstream kernel >> > Our reproducer fails about 20% of the time. We have not tried on an >> > upstream kernel. >> > >> >> Please do try on something upstream if possible. Also, what exactly >> triggers the failure that ultimately leads to log recovery on the next >> mount? Are you doing an XFS shutdown or shutting down the block device >> underneath the fs? >> >> > >Could you provide an xfs_metadump image of the filesystem that fails log recovery with CRC errors? >> > I can capture such on Monday. >> > For now, just the journal (gathered with xfs_logprint -C fsLog) can be >> > found at (10M) >> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/fsLog >> > . >> >> Thanks. This certainly looks a bit strange. Up through sectors 19692 of >> the log are populated with a cycle 2 LSN. After that appears to be the >> previous cycle 1, which suggests that is the head of the log. The tail >> then points to 1:20105, but from that point forward a strange mix of >> cycle 1 and cycle 2 LSNs are seen. >> >> It's not totally clear to me from the physical log state whether this >> was the in-core state at the time of the crash, but a first guess may be >> that at least this is what log recovery thinks and thus complains about >> the tail of the log being borked. >> > > Another observation and possible theory... the last two cycle 2 headers > in the log are: > > 20116 HEADER Cycle 2 tail 2:010284 len 32256 ops 288 > ... > 20180 HEADER Cycle 2 tail 2:010284 len 32256 ops 261 > > ... which appear sanely spaced apart. They are mixed with cycle 1/2 > changes, which seems to correspond with your observation that parts of > the 32k log buffers are succeeding and parts are failing. IOW, these log > records are effectively incomplete on disk. > > That accounts for 2 of 8 log buffers. If we assume these are the "last" > two log buffers, each log buffer is full and also spaced 64 sectors > apart, then the first log buffer would be targetting log sector (20116 - > 6 * 64) = 19732. That refers to the head of the on-disk log noted above: > > ... > 19628 HEADER Cycle 2 tail 1:020105 len 32256 ops 327 > 19692 HEADER Cycle 2 tail 1:020105 len 19968 ops 89 > 19732 HEADER Cycle 1 tail 1:020105 len 0 ops 0 > [00000 - 19732] Cycle 0x00000002 New Cycle 0x00000001 > 19733 HEADER Cycle 1 tail 1:020105 len 0 ops 0 > ... > > So I'm wondering whether if it's possible that all of the log buffers > were dirty at the time of the crash, the last two were the only ones > that made it to disk, and they only made it partially at that. Further, > in partially writing those last two bufs at 20116 and beyond, we've > begun to overwrite the tail pointed to by the last record on disk before > the head (tail 1:020105 -> cycle 1 sector 20105), corrupting that > reference as well. > > There are also a bunch of the len == 0 headers at 19733 and beyond, > which I'm not quite clear on. I'm wondering if log recovery might > actually do that after it settles on a log head. That could mean there > actually was more cycle 2 data from other log bufs in that range but it > was all detected as invalid and cleared out during log recovery. > > When you grab the metadump to provide, can you make sure you do so > _before_ you attempt log recovery on the filesystem for the first time? > E.g., reproduce crash, umount & create metadump, mount and test for > corruption, if reproduce -> send metadump, else toss metadump & repeat. > This is because the first attempt at log recovery may modify the > physical log and cause some information loss if we take the metadump > after you've already attempted recovery. > > Brian > > P.S., I've also been assuming based on the log state that you haven't > used the logbufs/logbsize mount options to change from the defaults, but > please do confirm that you aren't changing either of those defaults. > Thanks. > >> > A log of the journal writes can be found at (17M) >> > https://s3.amazonaws.com/permabit-development/20170609-xfsUpload/log_writes_only.blkparse. >> > It is in a blkparse-like format. For each 512-byte sector of a bio, >> > either starting or finishing, the data hash is recorded; the sector is >> > recorded; and the index of this sector and the number of sectors >> > within the current bio is recorded. Bios recorded as "FAILED" indicate >> > that the device has crashed / become disconnected and the bio has >> > returned with an error. >> > >> > >From there, it searches a previous number of blocks >> > based on the maximum log buffer concurrency allowed by the fs to >> > determine whether any such "holes" exist in that range. If so, the head >> > is walked back to the first instance of such a "hole," effectively >> > working around out of order buffer completion at the time of a >> > filesystem crash. >> > >> > In the case logged and linked above, there are 256k of outstanding log >> > write bytes at once; 187k of these fail and 69k succeed. Of the 69k which >> > succeed, they are always the first 1k of the 4k block to which they >> > belong. Is this within the permitted amount of outstanding log buffers? >> > >> >> Generally, yes. Log recovery uses the maximum in-core log buffer count >> (8) and the maximum supported log buffer size (256k) to determine how >> far to look back (2M). The default log buffer size is 32k, so I suspect >> 256k means every log buffer was in-flight at this time. >> >> When you reproduce this and presumably provide a metadump, could you >> also collect the /sys/fs/xfs/<dev>/log/* state information at the time >> of the crash of that particular metadump? Note that this directory >> disappears after unmount so you'll have to collect it before. >> >> Brian >> >> > Thanks! >> > >> > Sweet Tea >> > >> > On Fri, Jun 9, 2017 at 7:44 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> > > On Thu, Jun 08, 2017 at 11:42:11AM -0400, Sweet Tea Dorminy wrote: >> > >> Greetings; >> > >> >> > >> When using XFS with a 1k block size atop our device, we regularly get >> > >> "log record CRC mismatch"es when mounting XFS after a crash, and we >> > >> are attempting to understand why. We are using RHEL7.3 with its kernel >> > >> 3.10.0-514.10.2.el7.x86_64, xfsprogs version 4.5.0. >> > >> >> > >> Tracing indicates the following situation occurs: >> > >> Some pair of consecutive locations contains data A1 and B1, respectively. >> > >> The XFS journal issues new writes to those locations, >> > >> containing data A2 and B2. >> > >> The write of B' finishes, but A' is still outstanding at the >> > >> time of the crash. >> > >> Crash occurs. The data on disk is A1 and B2, respectively. >> > >> XFS fails to mount, complaining that the checksum mismatches. >> > >> >> > >> Does XFS expect sequentially issued journal IO to be committed to disk >> > >> in the order of issuance due to the use of FUA? >> > > >> > > Journal IO is not sequentially issued. It's an async process. At >> > > runtime, ordering is handled by journal IO completion processing >> > > being queued and run in order, so IOs can both be issued and >> > > physically complete out of order. >> > > >> > > Log recovery is supposed to handle this. It searches and finds the >> > > latest contiguous journal entry and does not replay past holes that >> > > may arise from out of order journal writes. >> > > >> > > CRC errors like this in recovery imply that journal writes are being >> > > torn or not completed fully, which may mean that your storage does >> > > not correctly implement flush/FUA ordering semantics.... >> > > >> > > Cheers, >> > > >> > > Dave. >> > > -- >> > > Dave Chinner >> > > david@xxxxxxxxxxxxx >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html