Re: [PATCH v2] xfs: byte range buffer dirty region tracking

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 14 Feb 2018 13:08:07 -0500

On Wed, Feb 14, 2018 at 08:49:12AM -0800, Darrick J. Wong wrote:
> On Wed, Feb 14, 2018 at 08:09:39AM -0500, Brian Foster wrote:
> > On Wed, Feb 14, 2018 at 09:02:20AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 13, 2018 at 08:15:26AM -0500, Brian Foster wrote:
> > > > On Tue, Feb 13, 2018 at 08:18:24AM +1100, Dave Chinner wrote:
> > > > > On Mon, Feb 12, 2018 at 09:26:19AM -0500, Brian Foster wrote:
> > > > > > :/ So it seems to
> > > > > > me this breaks a technically valid case in weird/subtle ways. For
> > > > > > example, why assert about last == 0, but then go on to add the range
> > > > > > anyways, explicitly not size it correctly, but then format it as if
> > > > > > nothing is wrong? If it were really wrong/invalid (which I don't think
> > > > > > it is), why not put the check in the log side and skip adding the range
> > > > > > rather than add it, skip sizing it, and then format it.
> > > > > 
> > > > > So what you're really concerned about is that I put asserts into the
> > > > > code to catch broken development code, but then allow production
> > > > > systems through without caring whether it works correctly because
> > > > > that boundary condition will never occur during runtime on
> > > > > production systems?
> > > > 
> > > > No. As already mentioned in my previous mail, I care little about the
> > > > asserts. Asserts can easily be removed if they turn out to be bogus.
> > > > Wrong asserts tend to have little negative effect on production users
> > > > because along with only affecting debug kernels, they'd have to be
> > > > fairly rare to slip through our testing. So I'm perfectly _happy_ to be
> > > > cautious with regard to asserts.
> > > > 
> > > > What I care much more about is not leaving latent bugs around in the
> > > > code. IMO, there is very rarely good enough justification to knowingly
> > > > commit buggy/fragile code to the kernel,
> > > 
> > > Hold on a minute!
> > > 
> > > I'm not asking anyone to commit buggy or fragile code. I've already
> > > fixed the off-by-one problems you've pointed out, and all I was
> > > trying to do was understand what you saw wrong with the asserts to
> > > catch a "should never happen" condition so I could change it in a
> > > way that you'd find acceptible.
> > > 
> > > There's no need to shout and rant at me....
> > > 
> > 
> > I pointed out the first-byte logging case looked broken. Rather than
> > indicate you're fixing that one way or another (as you had done for the
> > other issues we've found), you argued that "nobody does that" or "should
> > never happen."
> > 
> > Sorry for the rant and/or whether I've misinterpreted your comment, but
> > I have no way to read that other than an attempt to justify the problem,
> > particularly when in comparison every other issue was clearly noted that
> > it would be fixed. Also note that I think you've slightly misinterpreted
> > my fragile code comment to be harsher than intended (or I didn't express
> > my position clearly...). I'm simply trying to explain why I'll likely
> > not ack this patch unless that problem is fixed (e.g., because I
> > consider it fragile, independent of current behavior of outside
> > contexts).
> > 
> > I'll try to restate my position more distinctly/clearly... I consider
> > this patch broken so long as it doesn't handle the set of valid inputs
> > defined by xfs_trans_log_buf() correctly, as the current code does. I
> > expect any first <= last range to DTRT and ensure the associated chunk
> > of the buffer is logged.
> > 
> > > > ... having said all that and having already wasted more time on this
> > > > than it would have taken for you to just fix the patch, I'll end my rant
> > > > with this splat[1]. It demonstrates the "boundary condition" that "will
> > > > never occur during runtime on production systems" (production system
> > > > level output included for extra fun ;P).
> > > 
> > > This is a pre-existing bug in xlog_cil_insert_format_items()
> > > that my change has exposed:
> > > 
> > >                 /* Skip items that do not have any vectors for writing */
> > > 		if (!shadow->lv_niovecs && !ordered)
> > > 			continue;
> > > 
> > > The code I added triggers this (niovecs == 0), and that now gives
> > > us the case where we have a dirty log item descriptor
> > > (XFS_LID_DIRTY) without a log vector attached to item->li_lv.
> > > Then in xlog_cil_insert_items():
> > > 
> > >                 /* Skip items which aren't dirty in this transaction. */
> > >                 if (!(lidp->lid_flags & XFS_LID_DIRTY))
> > >                         continue;
> > > 
> > >                 /*
> > >                  * Only move the item if it isn't already at the tail. This is
> > >                  * to prevent a transient list_empty() state when reinserting
> > >                  * an item that is already the only item in the CIL.
> > >                  */
> > >                 if (!list_is_last(&lip->li_cil, &cil->xc_cil))
> > >                         list_move_tail(&lip->li_cil, &cil->xc_cil);
> > > 
> > > 
> > > We put that "clean" log item on the CIL because XFS_LID_DIRTY is
> > > set, and then when we push the CIL in xlog_cil_push(), we trip over
> > > a dirty log item without a log vector when chaining log vectors to
> > > pass to the log writing code here:
> > > 
> > >         while (!list_empty(&cil->xc_cil)) {
> > >                 struct xfs_log_item     *item;
> > > 
> > >                 item = list_first_entry(&cil->xc_cil,
> > >                                         struct xfs_log_item, li_cil);
> > >                 list_del_init(&item->li_cil);
> > >                 if (!ctx->lv_chain)
> > >                         ctx->lv_chain = item->li_lv;
> > >                 else
> > >                         lv->lv_next = item->li_lv;       <<<<<<<<<
> > >  >>>>>>>>       lv = item->li_lv;
> > >                 item->li_lv = NULL;
> > >                 num_iovecs += lv->lv_niovecs;
> > >         }
> > > 
> > > i.e. lv ends up null part way through the log item chain we are
> > > processing and the next loop iteration fails.
> > > 
> > 
> > Ok, I hadn't traced through the actual crash. The above all sounds sane
> > to me...
> > 
> > > IOWs, the bug isn't in the patch I wrote - it has uncovered a
> > > latent bug added years ago for a condition that had never, ever been
> > > exercised until now.
> > > 
> > 
> > Note that I agree you're probably right that there is a bug worth fixing
> > in the CIL code to not crash in this case. The crash is just a symptom,
> > however. There's still a bug in this patch because the buffer needs to
> > be logged.
> > 
> > IOW, the purpose of this test is to demonstrate that the "should never
> > happen" case argued above "actually can happen."
> > 
> > > Brian, can you now give me all the details of what you were doing to
> > > produce this and turn on CONFIG_XFS_DEBUG so that it catches the
> > > zero length buffer that was logged when it happens?  That way I can
> > > test a fix for this bug and that the buffer range logging exercises
> > > this case properly...
> > > 
> > 
> > Yep. It's a specially crafted symlink creation on a small FSB, v4
> > filesystem with fragmented free space. We log symlink buffers on v4
> > filesystems without any header, so the buffer content is not dictated by
> > any internal fs metadata format. If the link target is large enough to
> > span multiple blocks and free space is fragmented such that those blocks
> > are discontiguous, we can end up logging (solely) the first byte of the
> > last buffer of the link target.
> > 
> > This is actually reproducible on demand so I'll just append a basic
> > recipe rather than collect the debug data and whatnot..
> > 
> > Brian
> > 
> > --- 8< ---
> > 
> > dev=<dev>
> > mnt=/mnt
> > 
> > sym=`for i in $(seq 0 512); do echo -n a; done`
> > 
> > mkfs.xfs -f -mcrc=0 -bsize=512 -dsize=25m $dev
> > mount $dev $mnt
> > 
> > dd if=/dev/zero of=$mnt/spc1
> > ~/xfstests-dev/src/punch-alternating $mnt/spc1
> > dd if=/dev/zero of=$mnt/spc2
> > xfs_io -c "fpunch 5m 25m" $mnt/spc2
> > 
> > for i in $(seq 0 2); do
> >         ln -s $sym $mnt/link.$i
> >         xfs_io -c fsync $mnt
> > done
> > 
> > umount $mnt
> 
> Did one of the "fragment free space, do stuff" xfstests hit this?  If
> not, would it be worth turning into a test?
> 

This was just an experiment on this patch. I haven't run xfstests so I
can't say for sure whether some existing test would have caught it
(though I suspect Dave would have hit the problem by now, if so). I'm
not sure it's worth an independent test since it really just exercises a
bug in a patch that is still under development (as opposed to a
regression). This sequence won't trigger any problems that I'm aware of
on upstream XFS.

Brian

> --D
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html