On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote: > Hello Dave, Brian, > I will describe a generic reproduction that you ask for. > > It was performed on pristine XFS code from 3.8.13, taken from here: > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git > This seems generic enough to me. Could you try on a more recent kernel? Dave had mentioned there were fixes in this area of log recovery, so a bisect might be all that is necessary to track down the patch you need. Otherwise, we can pick up debugging from something more recent. Brian > top commit being: > commit dbf932a9b316d5b29b3e220e5a30e7a165ad2992 > Author: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > Date: Sat May 11 13:57:46 2013 -0700 > > Linux 3.8.13 > > > I made a single (I swear!) code change in XFS: > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 96fcbb8..d756bf6 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -3736,6 +3736,9 @@ xlog_recover( > if ((error = xlog_find_tail(log, &head_blk, &tail_blk))) > return error; > > + xfs_notice(log->l_mp, "Sleep 10s before xlog_do_recover"); > + msleep(10000); > + > if (tail_blk != head_blk) { > /* There used to be a comment here: > * > > Fresh XFS was formatted on a 20 GB block device within a VM, using: > mkfs.xfs -f -K /dev/vde -p /etc/zadara/xfs.protofile > and: > root@vc-00-00-1383-dev:~# cat /etc/zadara/xfs.protofile > dummy : bootfilename, not used, backward compatibility > 0 0 : numbers of blocks and inodes, not used, > backward compatibility > d--777 0 0 : set 777 perms for the root dir > $ > $ > > I mounted XFS with the following options: > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0 > > I started a couple of processes writing files sequentially onto this mount > point, and after few seconds crashed the VM. > When the VM came up, I took the metadump file and placed it in: > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing > > Then I set up the following Device Mapper target onto /dev/vde: > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0" > I am attaching the code (and Makefile) of dm-linear-custom target. It is > exact copy of dm-linear, except that it has a module parameter. With the > parameter set to 0, this is an identity mapping onto /dev/vde. If the > parameter is set to non-0, all WRITE bios are failed with ENOSPC. There is a > workqueue to fail them in a different context (not sure if really needed, > but that's what our "real" custom > block device does). > > Now I did: > mount -o noatime,sync /dev/mapper/VDE /mnt/xfs > > The log recovery flow went into the sleep that I added, and then I did: > echo 1 > /sys/module/dm_linear_custom/parameters/fail_writes > > Problem reproduced: > Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.000657] XFS (dm-0): > Mounting Filesystem > Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.026991] XFS (dm-0): Sleep > 10s before xlog_do_recover > Aug 12 14:23:14 vc-00-00-1383-dev kernel: [ 185.028113] XFS (dm-0): > Starting recovery (logdev: internal) > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556622] XFS (dm-0): > metadata I/O error: block 0x2 ("xlog_recover_iodone") error 28 numblks 1 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556675] XFS (dm-0): > metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556680] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556683] XFS (dm-0): I/O > Error Detected. Shutting down filesystem > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556684] XFS (dm-0): Please > umount the filesystem and rectify the problem(s) > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556766] XFS (dm-0): > metadata I/O error: block 0xa00002 ("xlog_recover_iodone") error 5 numblks 1 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556769] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556771] XFS (dm-0): > metadata I/O error: block 0xa00008 ("xlog_recover_iodone") error 5 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556774] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556776] XFS (dm-0): > metadata I/O error: block 0xa00010 ("xlog_recover_iodone") error 5 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556779] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556781] XFS (dm-0): > metadata I/O error: block 0xa00018 ("xlog_recover_iodone") error 5 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556783] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556785] XFS (dm-0): > metadata I/O error: block 0xa00040 ("xlog_recover_iodone") error 5 numblks > 16 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556788] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556790] XFS (dm-0): > metadata I/O error: block 0xa00050 ("xlog_recover_iodone") error 5 numblks > 16 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556793] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556880] XFS (dm-0): > metadata I/O error: block 0xa00001 ("xlog_recover_iodone") error 28 numblks > 1 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556884] XFS (dm-0): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556977] XFS (dm-0): log > mount/recovery failed: error 28 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.557215] XFS (dm-0): log > mount failed > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573194] XFS (): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573214] XFS (): metadata > I/O error: block 0x18 ("xlog_recover_iodone") error 28 numblks 8 > Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.574685] XFS (): > xfs_do_force_shutdown(0x1) called from line 377 of file > /mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address > = 0xffffffffa0349f68 > > As you see, after mount completes, IO callbacks are still arriving and > printing junk (empty string in this case). Immediately after that kernel > dies. > > Is this description generic enough? > > Thanks, > Alex. > > > > > -----Original Message----- From: Brian Foster > Sent: 12 August, 2014 3:03 PM > To: Dave Chinner > Cc: Alex Lyakas ; xfs@xxxxxxxxxxx > Subject: Re: use-after-free on log replay failure > > On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote: > >On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote: > >> On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote: > >> > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@xxxxxxxxxx> > > > >wrote: > >> > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > >..... > >> > >> But I believe, my analysis shows that during the mount sequence XFS > >> > >> does not > >> > >> wait properly for all the bios to complete, before failing the > > > >>> mount > >> > >> sequence back to the caller. > >> > >> > >> > > > >> > > As an experiment, what about the following? Compile tested only and > >> > > not > >> > > safe for general use. > >... > >> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > >> > > index cd7b8ca..fbcf524 100644 > >> > > --- a/fs/xfs/xfs_buf.c > >> > > +++ b/fs/xfs/xfs_buf.c > >> > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > >> > > * case nothing will ever complete. It returns the I/O error code, > >> > > if any, or > >> > > * 0 if there was no error. > >> > > */ > >> > > -int > >> > > -xfs_buf_iowait( > >> > > - xfs_buf_t *bp) > >> > > +static int > >> > > +__xfs_buf_iowait( > >> > > + struct xfs_buf *bp, > >> > > + bool skip_error) > >> > > { > >> > > trace_xfs_buf_iowait(bp, _RET_IP_); > >> > > > >> > > - if (!bp->b_error) > >> > > + if (skip_error || !bp->b_error) > >> > > wait_for_completion(&bp->b_iowait); > >> > > > >> > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > >> > > return bp->b_error; > >> > > } > >> > > > >> > > +int > >> > > +xfs_buf_iowait( > >> > > + struct xfs_buf *bp) > >> > > +{ > >> > > + return __xfs_buf_iowait(bp, false); > >> > > +} > >> > > + > >> > > xfs_caddr_t > >> > > xfs_buf_offset( > >> > > xfs_buf_t *bp, > >> > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > >> > > bp = list_first_entry(&io_list, struct xfs_buf, > > > >> b_list); > >> > > > >> > > list_del_init(&bp->b_list); > >> > > - error2 = xfs_buf_iowait(bp); > >> > > + error2 = __xfs_buf_iowait(bp, true); > >> > > xfs_buf_relse(bp); > >> > > if (!error) > >> > > error = error2; > > > >Not waiting here on buffer error should not matter. Any buffer that > >is under IO and requires completion should be referenced, and that > >means it should be caught and waited on by xfs_wait_buftarg() in the > >mount failure path after log recovery fails. > > > > I think that assumes the I/O is successful. Looking through > xlog_recover_buffer_pass2() as an example, we read the buffer which > should return with b_hold == 1. The delwri queue increments the hold and > we xfs_buf_relse() in the return path (i.e., buffer is now held by the > delwri queue awaiting submission). > > Sometime later we delwri_submit()... xfs_buf_iorequest() does an > xfs_buf_hold() and xfs_buf_rele() within that single function. The > delwri_submit() releases its hold after xfs_buf_iowait(), which I guess > at that point bp should go onto the lru (b_hold back to 1 in > xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this > point. > > So unless I miss something or got the lifecycle wrong here, which is > easily possible ;), this all hinges on xfs_buf_iowait(). That's where > the last hold forcing the buffer to stay around goes away. > xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If > xfs_buf_iowait() is racy in the event of I/O errors via the bio > callback, I think this path is susceptible just the same. > > >> > > --- > >> > I think that this patch fixes the problem. I tried reproducing it like > >> > 30 times, and it doesn't happen with this patch. Dropping this patch > >> > reproduces the problem within 1 or 2 tries. Thanks! > >> > What are next steps? How to make it "safe for general use"? > >> > > >> > >> Ok, thanks for testing. I think that implicates the caller bypassing the > >> expected blocking with the right sequence of log recovery I/Os and > >> device failure. TBH, I'd still like to see the specifics, if possible. > >> Could you come up with a generic reproducer for this? I think a metadump > >> of the fs with the dirty log plus whatever device failure simulation > >> hack you're using would suffice. > > > >The real issue is we don't know exactly what code is being tested > >(it's 3.8 + random bug fix backports + custom code). Even if we have > >a reproducer there's no guarantee it will reproduce on a current > >kernel. IOWs, we are stumbling around in the dark bashing our heads > >against everything in the room, and that just wastes everyone's > >time. > > > >We need a reproducer that triggers on a current, unmodified > >kernel release. You can use dm-faulty to error out all writes just > >like you are doing with your custom code. See > >xfstests::tests/generic/321 and common/dmflakey for to do this. > >Ideally the reproducer is in a form that xfstests can use.... > > > >If you can't reproduce it on an upstream kernel, then git bisect is > >your friend. It will find the commit that fixed the problem you are > >seeing.... > > > > Ugh, yeah. The fact that this was customized as such apparently went > over my head. I agree completely. This needs to be genericized to a > pristine, preferably current kernel. The experiment patch could be > papering over something completely different. > > >> The ideal fix is not yet clear to me. > > > >We are not even that far along - the root cause of the bug is not at > >all clear to me. :/ > > > > Yeah.. the above was just the theory that motivated the experiment in > the previously posted patch. It of course remains a theory until we can > see the race in action. I was referring to the potential fix for the > raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone > handling, while still asking for a reproducer to confirm the actual > problem. FWIW, I'm not too high on changes in the buf management code, > even a smallish behavior change, without a real trace of some sort that > documents the problem and justifies the change. > > Brian > > >Cheers, > > > >Dave. > >-- > >Dave Chinner > >david@xxxxxxxxxxxxx > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs