Re: [PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Fri, 18 Aug 2017 08:37:34 -0700

On Fri, Aug 18, 2017 at 07:39:34AM -0400, Brian Foster wrote:
> On Thu, Aug 17, 2017 at 03:31:21PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 17, 2017 at 04:38:26PM -0400, Brian Foster wrote:
> > > On Mon, Aug 14, 2017 at 10:28:09AM +1000, Dave Chinner wrote:
> > > > On Sat, Aug 12, 2017 at 10:04:34AM -0400, Brian Foster wrote:
> > > > > On Sat, Aug 12, 2017 at 10:16:37AM +1000, Dave Chinner wrote:
> > > > > > On Fri, Aug 11, 2017 at 10:27:43AM -0400, Brian Foster wrote:
> > > > > > > On Fri, Aug 11, 2017 at 12:22:04PM +1000, Dave Chinner wrote:
> > > > > > Using XFS_BLI_ORDERED allows us to log the buffer without recording
> > > > > > a new dirty range on the buffer. IOWs, it retains whatever dirty range
> > > > > > it already had, and so after joining, marking it ordered and then
> > > > > > logging the buffer, we have a XFS_BLI_DIRTY | XFS_BLI_ORDERED buffer
> > > > > > in the transaction.
> > > > > > 
> > > > > > The question is this: what happens when a XFS_BLI_ORDERED buffer
> > > > > > with a pre-existing dirty region is formatted for the CIL? We
> > > > > > haven't done that before, so I'm betting that we don't relog the
> > > > > > dirty region like we should be doing....
> > > > > > 
> > > > > > ... and we don't relog the existing dirty range because the
> > > > > > ordered flag takes precedence.
> > > > > > 
> > > > > 
> > > > > Right.. so it seems that the current implementation for ordered buffers
> > > > > assumes a buffer is only ever used in one mode or the other.
> > > > > Additionally, the AIL assumes that any reinserted item has been fully
> > > > > relogged and so it moves the LSN forward unconditionally. Current
> > > > > ordered buffer processing violates this constraint for an already logged
> > > > > buffer.
> > > > 
> > > > Right, but it's not been a concern until now because we've only ever
> > > > used ordered buffers on newly allocated buffers that haven't been
> > > > previously logged.
> > > > 
> > > 
> > > Shortly after starting to test the latest refactoring with some asserts
> > > that enforce "strict" buffer ordering, I (sure enough ;)) ran into an
> > > instance of the above.
> > > 
> > > The issue is with the bmbt owner change operation that is part of the
> > > swap extent sequence. The owner change code ends up setting a buffer
> > > ordered that is currently dirty due to an unrelated previous
> > > transaction. The potential side effects of this are basically what we've
> > > already laid out above. Note that this is only relevant on
> > > crc=1,rmapbt=0 filesystems.
> > > 
> > > I'm currently testing out something that uses the rmapbt based swap
> > > implementation for all filesystems. IIUC, that algorithm doesn't
> > > actually require rmapbt support, it is just unfortunately named as such
> > > because we only use it on rmapbt filesystems. I suspect that is because
> > > rmapbt management requires the unmap/remapping of space to keep the tree
> > > consistent (and I suspect this is limited to rmapbt fs' for performance
> > > reasons, Darrick?) vs. it being something that actually depends on the
> > 
> > The rmapbt extent swapping function requires rmapbt because it writes
> > rmap and (more importantly) bmap intent items to the log.  An older
> > kernel cannot be allowed to recover such log items, which the rocompat
> > feature check in xfs_mount_validate_sb prevents.
> > 
> 
> Hmm, that's not what I've observed. I've tested recovery of both the
> rmapbt algorithm (with and without rmapbt actually enabled) as well as

But did you try recovery of the rmapbt algorithm on a pre-4.8 kernel?
That's what I meant by 'older kernel cannot be allowed...'.

IOWs, it's ok for log recovery on a 4.9+ kernel to process bmap intent
items on a non-rmap/non-reflink fs.

> the bmbt owner change algorithm. bmbt owner change recovery is actually
> currently broken due to the inode owner check in
> xfs_btree_lookup_get_block(), but I've been waiting to determine how we
> end up resolving this higher level issue before digging any further into
> that (I suspect it just needs to be bypassed during recovery). I hadn't
> hit any issues with the former cases.
>
> Looking at the code, I see the high level algorithm basically queues
> bmap unmap/map deferred ops across the file address space. That
> eventually resolves to __xfs_bunmapi()/xfs_bmapi_remap() calls that
> migrate the mappings. __xfs_bunmapi() clearly has to work in general for
> !rmapbt filesystems. AFAICT, all the XFS_BMAPI_REMAP flag does here is
> prevent us from actually freeing the blocks.
> 
> Moving along, we next end up in xfs_bmapi_remap() to map said blocks
> back into the other file... where I also don't see anything feature
> specific in the implementation. It uses xfs_bmap_add_extent_hole_real(),
> which calls xfs_rmap_map_extent(), which calls
> xfs_rmap_update_is_needed(), which checks xfs_sb_version_hasrmapbt().
> 
> Looking back at the unmap side, it looks like we have the same pattern
> down in xfs_bmap_del_extent() -> xfs_rmap_unmap_extent() -> ...
> 
> So AFAICT we have feature checks in the appropriate places. Am I missing
> something else?

Nope.

--D

> > I've never tried to see what happens when log recovery hits an item with
> > an unknown magic number -- I think we just bail out with EIO and leave
> > the log for someone else to recover?  In theory you could use the
> > rmap swapextents function all the time, but anyone trying to recover
> > after a crash with an old kernel will faceplant hard.  We could also
> > implement a new log-incompat flag if !rmapbt and someone tries to log a
> > bmap item... though iirc log_incompat doesn't exist for v4 filesystems,
> > so that still doesn't help us.
> > 
> > Conceptually, at least, the rmap swapext function should be capable of
> > swapping /any/ two inodes, not just the particular set of circumstances
> > required by xfs_swap_extents.
> > 
> 
> Indeed, thanks. If there is some dependency here it seems like it
> shouldn't be too difficult to break.
> 
> > > rmapbt.
> > > 
> > > FWIW, I've so far at least considered some of the following alternatives
> > > to try and preserve the existing bmbt owner change algorithm before
> > > moving on to testing the rmapbt approach:
> > > 
> > > 1.) Don't use ordered buffers for the bmbt owner change. My
> > > understanding is that this is a non-starter as we'd need to log every
> > > buffer in the tree in a single transaction.
> > 
> > Yep.
> > 
> > > 2.) Update the extent swap operation to push the log and AIL based on
> > > the most recent LSN of any buffer in the bmbt before the bmbt owner
> > > change takes place. I've experimented with this enough to hack together
> > > an xfs_ail_push_sync() prototype and also observe that using the
> > > ili_item.li_lsn of the inode is not sufficient.
> > > 
> > > IOW, this basically requires a second bmbt walk per-inode to discover
> > > the max lsn of all bmbt blocks. That may or may not be a performance
> > > issue since we have to walk the bmbt anyways. Otherwise I think this is
> > > a viable option from a correctness perspective.
> > > 
> > > A simplified variant of this (that just crossed my mind while writing
> > > this up) may be to do an in-core check for dirty bmbt buffers on an
> > > inode and just do an xfs_ail_push_all_sync() if any are found.
> > 
> > I imagine you have to keep the inodes locked throughout all this to
> > prevent something else from wandering in and re-logging the bmbt block?
> > 
> 
> Yeah, or retry or fail with -EAGAIN or something if we hit the
> problematic state, and/or see if there's something we can do to
> effectively quiesce the bmbt while under iolock. I hadn't really got
> that far tbh because this didn't seem like a great first option.
> 
> Brian
> 
> > > 3.) Synchronous write the affected buffers at xfs_trans_ordered_buf()
> > > time. This is a bit ugly because it requires to xfs_trans_brelse() the
> > > buf from the current transaction, _bwrite() it and then _bjoin() it back
> > > to the transaction all within xfs_trans_ordered_buf(). The advantage
> > > over #2 is that this only occurs for buffers with logged segments in the
> > > bli. It also effectively implements a form of generic support for
> > > ordering previously logged buffers.
> > > 
> > > 4.) Implement the ordered buffer relogging logic discussed below: detect
> > > ordered buffers with previously logged segments and relog them at
> > > ->iop_format() time. I suppose I could alleviate my aforementioned
> > > concerns with an assert that verifies an ordered buffer isn't already
> > > dirtied by the _current_ transaction (i.e., to catch the case of the
> > > "screwed up optimization" I was concerned about).
> > > 
> > > That aside, I think there is another problem here that we missed
> > > previously: transactions that use ordered buffers don't reserve log
> > > space for them, but relogged items require log reservation. If I follow
> > > the code correctly, a relogged item that is presently in the CIL may not
> > > use reservation in a subsequent tx. If the item resides in the AIL at
> > > relog commit time, however, the relogging transaction must have
> > > reservation for the item (the physical log space for the item is freed
> > > and the space used for the relog is accounted out of the reservation).
> > > 
> > > So I think the only way we can support the ability to relog previously
> > > dirty buffers that have been marked ordered is to require log
> > > reservation for them as for normal physically logged buffers, which kind
> > > of defeats the purpose.
> > > 
> > > 5.) A deferred op variant of the bmbt owner change algorithm. I haven't
> > > fully thought this one through yet so it may not be viable, but the
> > > general idea is to use deferred ops to conditionally physically log
> > > previously dirty buffers where required. For example, use a transaction
> > > with log reservation for one full buffer, log a new bmbt owner change
> > > intent and start running through the bmbt scan attempting to order
> > > buffers. Update xfs_trans_ordered_buf() to explicitly fail on buffers
> > > with dirty segments. When said failure occurs during the bmbt scan,
> > > physically log the buffer and terminate the scan with -EAGAIN. The
> > > deferred infra rolls the tx, relogs the intent and has to restart the
> > > bmbt scan over again. This repeats until we've processed the entire
> > > tree.
> > > 
> > > I had a couple more thoughts that are similarly not yet thought out and
> > > thus not worth rambling about. Thoughts on any of the above? On using
> > > the rmapbt algorithm? Other ideas?
> > > 
> > > Brian
> > > 
> > > > > > Ok, the ordered buffer checks in xfs_buf_item_size() and
> > > > > > xfs_buf_item_format() need to also check for dirty regions. If dirty
> > > > > > regions exist, then we treat it like a normal buffer rather than an
> > > > > > ordered buffer. We can factor the dirty region check out of
> > > > > > xfs_buf_item_unlock() for this...
> > > > > > 
> > > > > > Actually, check the case in xfs_buf_item_size() and remove the
> > > > > > ordered flag if there are dirty regions. Then xfs_buf_item_format()
> > > > > > will do the right thing without needing a duplicate check...
> > > > > > 
> > > > > 
> > > > > I think that would work, assuming we actually check the
> > > > > xfs_buf_log_format for dirty-ness rather than just the log item. As it
> > > > > is, note that ordered buffers are still "logged" in the transaction
> > > > > because otherwise the transaction infrastructure will assume it made no
> > > > > change to the buf and toss the log item at commit time (we also need to
> > > > > set up I/O completion on the buf and whatnot).
> > > > 
> > > > *nod*
> > > > 
> > > > > What concerns me about this approach is that I think we introduce the
> > > > > possibility for subtle bugs. Existing ordered buffer code does this:
> > > > > 
> > > > >         xfs_trans_ordered_buf(tp, fbuf);
> > > > >         xfs_trans_log_buf(tp, fbuf, 0,
> > > > >                           BBTOB(fbuf->b_length) - 1);
> > > > > 
> > > > > ... which should continue to work fine. Allowing ordered buffers to
> > > > > physically log means that something like this:
> > > > > 
> > > > >         xfs_trans_log_buf(tp, fbuf, 0,
> > > > >                           BBTOB(fbuf->b_length) - 1);
> > > > >         xfs_trans_ordered_buf(tp, fbuf);
> > > > > 
> > > > > ... is now a bug that is only apparent after scrutiny of xfs_trans_*()
> > > > > and logging internals. Granted, the above already is incorrect, but it
> > > > > technically still works as expected. I don't see the need to turn that
> > > > > into a real problem by actually logging the buffer when we might not
> > > > > expect to.
> > > > 
> > > > Well, it's not a "things go bad" bug. It's a "we screwed up an
> > > > optimisation" bug, because logging the buffer contents unnecessarily
> > > > only increases the required log bandwidth. It shouldn't affect
> > > > replay because the buffer is still correctly ordered in the log.
> > > > Hence both the transient and end states of the buffer during replay
> > > > will still be the same...
> > > > 
> > > > > So while I agree that this could probably be made to work and I think it
> > > > > is ideal to doing any kind of logged range tracking in the deferred ops
> > > > > code, it still seems more tricky than it needs to be. To relog a held
> > > > > buffer in a new transaction, why not just mark the lidp dirty in the new
> > > > > transaction so it inherits all existing dirty segments? AFAICT, all we
> > > > > really need to do is:
> > > > > 
> > > > >         tp->t_flags |= XFS_TRANS_DIRTY;
> > > > >         lidp->lid_flags |= XFS_LID_DIRTY;
> > > > > 
> > > > > ... on the new transaction and everything should just work as designed
> > > > > (for a buffer that has been previously logged, held, rolled and
> > > > > rejoined).
> > > > 
> > > > We would also need to set:
> > > > 
> > > > 	bip->bli_flags |= XFS_BLI_DIRTY | XFS_BLI_LOGGED;
> > > > 
> > > > which means we should....
> > > > 
> > > > > To elaborate a bit, I think we could refactor xfs_trans_log_buf() into a
> > > > > new xfs_trans_dirty_buf() helper that covers all of the relevant bits
> > > > > not related to actually dirtying the bli. xfs_trans_log_buf() would call
> > > > > xfs_trans_dirty_buf() and thus would not change functionally.
> > > > > xfs_trans_ordered_buf() could now call xfs_trans_dirty_buf() and thus
> > > > > the existing ordered buf users would no longer need to log a range of
> > > > > the buffer (which doesn't make much sense anyways).
> > > > 
> > > > ... do this. :)
> > > > 
> > > > > Finally, the
> > > > > deferred infrastructure could join/dirty/hold the buffer to the new
> > > > > transaction after each roll without needing to track and relog specific
> > > > > regions of the buffer. Thoughts?
> > > > 
> > > > Yup, that's exactly what I was thinking should be possible by using
> > > > ordered buffers.... :)
> > > > 
> > > > And Christoph's rework of the transaction roll and deferred inode
> > > > handling that he just posted should make adding buffer handling
> > > > quite a bit neater and cleaner.
> > > > 
> > > > > Unless I'm missing something as to why this is busted, I'll take a
> > > > > closer look at the code and float an rfc next week since otherwise it
> > > > > sounds like this is something we could actually fix up in the ordered
> > > > > buffer code today.
> > > > 
> > > > Cool.
> > > > 
> > > > > > Nothing in XFS is ever simple, is it? :P
> > > > > 
> > > > > There used to be a level of satisfaction at feeling I understood some
> > > > > new corner of XFS. Nowadays I know that just means I'm not yet aware of
> > > > > whatever dragons remain in that corner (is that paranoia? not if it's
> > > > > true!). :P
> > > > 
> > > > Ah, the true signs of expertise: developing a knowledge base and
> > > > insight deep enough to understand that there is always another
> > > > hidden dragon poised to bite your head off. :)
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@xxxxxxxxxxxxx
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html