Re: [PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 18 Aug 2017 11:48:04 -0400

On Fri, Aug 18, 2017 at 08:37:34AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 18, 2017 at 07:39:34AM -0400, Brian Foster wrote:
> > On Thu, Aug 17, 2017 at 03:31:21PM -0700, Darrick J. Wong wrote:
> > > On Thu, Aug 17, 2017 at 04:38:26PM -0400, Brian Foster wrote:
> > > > On Mon, Aug 14, 2017 at 10:28:09AM +1000, Dave Chinner wrote:
> > > > > On Sat, Aug 12, 2017 at 10:04:34AM -0400, Brian Foster wrote:
> > > > > > On Sat, Aug 12, 2017 at 10:16:37AM +1000, Dave Chinner wrote:
> > > > > > > On Fri, Aug 11, 2017 at 10:27:43AM -0400, Brian Foster wrote:
> > > > > > > > On Fri, Aug 11, 2017 at 12:22:04PM +1000, Dave Chinner wrote:
> > > > > > > Using XFS_BLI_ORDERED allows us to log the buffer without recording
> > > > > > > a new dirty range on the buffer. IOWs, it retains whatever dirty range
> > > > > > > it already had, and so after joining, marking it ordered and then
> > > > > > > logging the buffer, we have a XFS_BLI_DIRTY | XFS_BLI_ORDERED buffer
> > > > > > > in the transaction.
> > > > > > > 
> > > > > > > The question is this: what happens when a XFS_BLI_ORDERED buffer
> > > > > > > with a pre-existing dirty region is formatted for the CIL? We
> > > > > > > haven't done that before, so I'm betting that we don't relog the
> > > > > > > dirty region like we should be doing....
> > > > > > > 
> > > > > > > ... and we don't relog the existing dirty range because the
> > > > > > > ordered flag takes precedence.
> > > > > > > 
> > > > > > 
> > > > > > Right.. so it seems that the current implementation for ordered buffers
> > > > > > assumes a buffer is only ever used in one mode or the other.
> > > > > > Additionally, the AIL assumes that any reinserted item has been fully
> > > > > > relogged and so it moves the LSN forward unconditionally. Current
> > > > > > ordered buffer processing violates this constraint for an already logged
> > > > > > buffer.
> > > > > 
> > > > > Right, but it's not been a concern until now because we've only ever
> > > > > used ordered buffers on newly allocated buffers that haven't been
> > > > > previously logged.
> > > > > 
> > > > 
> > > > Shortly after starting to test the latest refactoring with some asserts
> > > > that enforce "strict" buffer ordering, I (sure enough ;)) ran into an
> > > > instance of the above.
> > > > 
> > > > The issue is with the bmbt owner change operation that is part of the
> > > > swap extent sequence. The owner change code ends up setting a buffer
> > > > ordered that is currently dirty due to an unrelated previous
> > > > transaction. The potential side effects of this are basically what we've
> > > > already laid out above. Note that this is only relevant on
> > > > crc=1,rmapbt=0 filesystems.
> > > > 
> > > > I'm currently testing out something that uses the rmapbt based swap
> > > > implementation for all filesystems. IIUC, that algorithm doesn't
> > > > actually require rmapbt support, it is just unfortunately named as such
> > > > because we only use it on rmapbt filesystems. I suspect that is because
> > > > rmapbt management requires the unmap/remapping of space to keep the tree
> > > > consistent (and I suspect this is limited to rmapbt fs' for performance
> > > > reasons, Darrick?) vs. it being something that actually depends on the
> > > 
> > > The rmapbt extent swapping function requires rmapbt because it writes
> > > rmap and (more importantly) bmap intent items to the log.  An older
> > > kernel cannot be allowed to recover such log items, which the rocompat
> > > feature check in xfs_mount_validate_sb prevents.
> > > 
> > 
> > Hmm, that's not what I've observed. I've tested recovery of both the
> > rmapbt algorithm (with and without rmapbt actually enabled) as well as
> 
> But did you try recovery of the rmapbt algorithm on a pre-4.8 kernel?
> That's what I meant by 'older kernel cannot be allowed...'.
> 
> IOWs, it's ok for log recovery on a 4.9+ kernel to process bmap intent
> items on a non-rmap/non-reflink fs.
> 

Ah, Ok... I misunderstood. The issue here then is that the bmap log
intents (and recovery ability) are implicitly tied to the rmapbt feature
bit. So this could work going forward, but we have the same backwards
compatibility issue as noted in the previous mail with regard to using a
new intent. Thanks.

Brian

> > the bmbt owner change algorithm. bmbt owner change recovery is actually
> > currently broken due to the inode owner check in
> > xfs_btree_lookup_get_block(), but I've been waiting to determine how we
> > end up resolving this higher level issue before digging any further into
> > that (I suspect it just needs to be bypassed during recovery). I hadn't
> > hit any issues with the former cases.
> >
> > Looking at the code, I see the high level algorithm basically queues
> > bmap unmap/map deferred ops across the file address space. That
> > eventually resolves to __xfs_bunmapi()/xfs_bmapi_remap() calls that
> > migrate the mappings. __xfs_bunmapi() clearly has to work in general for
> > !rmapbt filesystems. AFAICT, all the XFS_BMAPI_REMAP flag does here is
> > prevent us from actually freeing the blocks.
> > 
> > Moving along, we next end up in xfs_bmapi_remap() to map said blocks
> > back into the other file... where I also don't see anything feature
> > specific in the implementation. It uses xfs_bmap_add_extent_hole_real(),
> > which calls xfs_rmap_map_extent(), which calls
> > xfs_rmap_update_is_needed(), which checks xfs_sb_version_hasrmapbt().
> > 
> > Looking back at the unmap side, it looks like we have the same pattern
> > down in xfs_bmap_del_extent() -> xfs_rmap_unmap_extent() -> ...
> > 
> > So AFAICT we have feature checks in the appropriate places. Am I missing
> > something else?
> 
> Nope.
> 
> --D
> 
> > > I've never tried to see what happens when log recovery hits an item with
> > > an unknown magic number -- I think we just bail out with EIO and leave
> > > the log for someone else to recover?  In theory you could use the
> > > rmap swapextents function all the time, but anyone trying to recover
> > > after a crash with an old kernel will faceplant hard.  We could also
> > > implement a new log-incompat flag if !rmapbt and someone tries to log a
> > > bmap item... though iirc log_incompat doesn't exist for v4 filesystems,
> > > so that still doesn't help us.
> > > 
> > > Conceptually, at least, the rmap swapext function should be capable of
> > > swapping /any/ two inodes, not just the particular set of circumstances
> > > required by xfs_swap_extents.
> > > 
> > 
> > Indeed, thanks. If there is some dependency here it seems like it
> > shouldn't be too difficult to break.
> > 
> > > > rmapbt.
> > > > 
> > > > FWIW, I've so far at least considered some of the following alternatives
> > > > to try and preserve the existing bmbt owner change algorithm before
> > > > moving on to testing the rmapbt approach:
> > > > 
> > > > 1.) Don't use ordered buffers for the bmbt owner change. My
> > > > understanding is that this is a non-starter as we'd need to log every
> > > > buffer in the tree in a single transaction.
> > > 
> > > Yep.
> > > 
> > > > 2.) Update the extent swap operation to push the log and AIL based on
> > > > the most recent LSN of any buffer in the bmbt before the bmbt owner
> > > > change takes place. I've experimented with this enough to hack together
> > > > an xfs_ail_push_sync() prototype and also observe that using the
> > > > ili_item.li_lsn of the inode is not sufficient.
> > > > 
> > > > IOW, this basically requires a second bmbt walk per-inode to discover
> > > > the max lsn of all bmbt blocks. That may or may not be a performance
> > > > issue since we have to walk the bmbt anyways. Otherwise I think this is
> > > > a viable option from a correctness perspective.
> > > > 
> > > > A simplified variant of this (that just crossed my mind while writing
> > > > this up) may be to do an in-core check for dirty bmbt buffers on an
> > > > inode and just do an xfs_ail_push_all_sync() if any are found.
> > > 
> > > I imagine you have to keep the inodes locked throughout all this to
> > > prevent something else from wandering in and re-logging the bmbt block?
> > > 
> > 
> > Yeah, or retry or fail with -EAGAIN or something if we hit the
> > problematic state, and/or see if there's something we can do to
> > effectively quiesce the bmbt while under iolock. I hadn't really got
> > that far tbh because this didn't seem like a great first option.
> > 
> > Brian
> > 
> > > > 3.) Synchronous write the affected buffers at xfs_trans_ordered_buf()
> > > > time. This is a bit ugly because it requires to xfs_trans_brelse() the
> > > > buf from the current transaction, _bwrite() it and then _bjoin() it back
> > > > to the transaction all within xfs_trans_ordered_buf(). The advantage
> > > > over #2 is that this only occurs for buffers with logged segments in the
> > > > bli. It also effectively implements a form of generic support for
> > > > ordering previously logged buffers.
> > > > 
> > > > 4.) Implement the ordered buffer relogging logic discussed below: detect
> > > > ordered buffers with previously logged segments and relog them at
> > > > ->iop_format() time. I suppose I could alleviate my aforementioned
> > > > concerns with an assert that verifies an ordered buffer isn't already
> > > > dirtied by the _current_ transaction (i.e., to catch the case of the
> > > > "screwed up optimization" I was concerned about).
> > > > 
> > > > That aside, I think there is another problem here that we missed
> > > > previously: transactions that use ordered buffers don't reserve log
> > > > space for them, but relogged items require log reservation. If I follow
> > > > the code correctly, a relogged item that is presently in the CIL may not
> > > > use reservation in a subsequent tx. If the item resides in the AIL at
> > > > relog commit time, however, the relogging transaction must have
> > > > reservation for the item (the physical log space for the item is freed
> > > > and the space used for the relog is accounted out of the reservation).
> > > > 
> > > > So I think the only way we can support the ability to relog previously
> > > > dirty buffers that have been marked ordered is to require log
> > > > reservation for them as for normal physically logged buffers, which kind
> > > > of defeats the purpose.
> > > > 
> > > > 5.) A deferred op variant of the bmbt owner change algorithm. I haven't
> > > > fully thought this one through yet so it may not be viable, but the
> > > > general idea is to use deferred ops to conditionally physically log
> > > > previously dirty buffers where required. For example, use a transaction
> > > > with log reservation for one full buffer, log a new bmbt owner change
> > > > intent and start running through the bmbt scan attempting to order
> > > > buffers. Update xfs_trans_ordered_buf() to explicitly fail on buffers
> > > > with dirty segments. When said failure occurs during the bmbt scan,
> > > > physically log the buffer and terminate the scan with -EAGAIN. The
> > > > deferred infra rolls the tx, relogs the intent and has to restart the
> > > > bmbt scan over again. This repeats until we've processed the entire
> > > > tree.
> > > > 
> > > > I had a couple more thoughts that are similarly not yet thought out and
> > > > thus not worth rambling about. Thoughts on any of the above? On using
> > > > the rmapbt algorithm? Other ideas?
> > > > 
> > > > Brian
> > > > 
> > > > > > > Ok, the ordered buffer checks in xfs_buf_item_size() and
> > > > > > > xfs_buf_item_format() need to also check for dirty regions. If dirty
> > > > > > > regions exist, then we treat it like a normal buffer rather than an
> > > > > > > ordered buffer. We can factor the dirty region check out of
> > > > > > > xfs_buf_item_unlock() for this...
> > > > > > > 
> > > > > > > Actually, check the case in xfs_buf_item_size() and remove the
> > > > > > > ordered flag if there are dirty regions. Then xfs_buf_item_format()
> > > > > > > will do the right thing without needing a duplicate check...
> > > > > > > 
> > > > > > 
> > > > > > I think that would work, assuming we actually check the
> > > > > > xfs_buf_log_format for dirty-ness rather than just the log item. As it
> > > > > > is, note that ordered buffers are still "logged" in the transaction
> > > > > > because otherwise the transaction infrastructure will assume it made no
> > > > > > change to the buf and toss the log item at commit time (we also need to
> > > > > > set up I/O completion on the buf and whatnot).
> > > > > 
> > > > > *nod*
> > > > > 
> > > > > > What concerns me about this approach is that I think we introduce the
> > > > > > possibility for subtle bugs. Existing ordered buffer code does this:
> > > > > > 
> > > > > >         xfs_trans_ordered_buf(tp, fbuf);
> > > > > >         xfs_trans_log_buf(tp, fbuf, 0,
> > > > > >                           BBTOB(fbuf->b_length) - 1);
> > > > > > 
> > > > > > ... which should continue to work fine. Allowing ordered buffers to
> > > > > > physically log means that something like this:
> > > > > > 
> > > > > >         xfs_trans_log_buf(tp, fbuf, 0,
> > > > > >                           BBTOB(fbuf->b_length) - 1);
> > > > > >         xfs_trans_ordered_buf(tp, fbuf);
> > > > > > 
> > > > > > ... is now a bug that is only apparent after scrutiny of xfs_trans_*()
> > > > > > and logging internals. Granted, the above already is incorrect, but it
> > > > > > technically still works as expected. I don't see the need to turn that
> > > > > > into a real problem by actually logging the buffer when we might not
> > > > > > expect to.
> > > > > 
> > > > > Well, it's not a "things go bad" bug. It's a "we screwed up an
> > > > > optimisation" bug, because logging the buffer contents unnecessarily
> > > > > only increases the required log bandwidth. It shouldn't affect
> > > > > replay because the buffer is still correctly ordered in the log.
> > > > > Hence both the transient and end states of the buffer during replay
> > > > > will still be the same...
> > > > > 
> > > > > > So while I agree that this could probably be made to work and I think it
> > > > > > is ideal to doing any kind of logged range tracking in the deferred ops
> > > > > > code, it still seems more tricky than it needs to be. To relog a held
> > > > > > buffer in a new transaction, why not just mark the lidp dirty in the new
> > > > > > transaction so it inherits all existing dirty segments? AFAICT, all we
> > > > > > really need to do is:
> > > > > > 
> > > > > >         tp->t_flags |= XFS_TRANS_DIRTY;
> > > > > >         lidp->lid_flags |= XFS_LID_DIRTY;
> > > > > > 
> > > > > > ... on the new transaction and everything should just work as designed
> > > > > > (for a buffer that has been previously logged, held, rolled and
> > > > > > rejoined).
> > > > > 
> > > > > We would also need to set:
> > > > > 
> > > > > 	bip->bli_flags |= XFS_BLI_DIRTY | XFS_BLI_LOGGED;
> > > > > 
> > > > > which means we should....
> > > > > 
> > > > > > To elaborate a bit, I think we could refactor xfs_trans_log_buf() into a
> > > > > > new xfs_trans_dirty_buf() helper that covers all of the relevant bits
> > > > > > not related to actually dirtying the bli. xfs_trans_log_buf() would call
> > > > > > xfs_trans_dirty_buf() and thus would not change functionally.
> > > > > > xfs_trans_ordered_buf() could now call xfs_trans_dirty_buf() and thus
> > > > > > the existing ordered buf users would no longer need to log a range of
> > > > > > the buffer (which doesn't make much sense anyways).
> > > > > 
> > > > > ... do this. :)
> > > > > 
> > > > > > Finally, the
> > > > > > deferred infrastructure could join/dirty/hold the buffer to the new
> > > > > > transaction after each roll without needing to track and relog specific
> > > > > > regions of the buffer. Thoughts?
> > > > > 
> > > > > Yup, that's exactly what I was thinking should be possible by using
> > > > > ordered buffers.... :)
> > > > > 
> > > > > And Christoph's rework of the transaction roll and deferred inode
> > > > > handling that he just posted should make adding buffer handling
> > > > > quite a bit neater and cleaner.
> > > > > 
> > > > > > Unless I'm missing something as to why this is busted, I'll take a
> > > > > > closer look at the code and float an rfc next week since otherwise it
> > > > > > sounds like this is something we could actually fix up in the ordered
> > > > > > buffer code today.
> > > > > 
> > > > > Cool.
> > > > > 
> > > > > > > Nothing in XFS is ever simple, is it? :P
> > > > > > 
> > > > > > There used to be a level of satisfaction at feeling I understood some
> > > > > > new corner of XFS. Nowadays I know that just means I'm not yet aware of
> > > > > > whatever dragons remain in that corner (is that paranoia? not if it's
> > > > > > true!). :P
> > > > > 
> > > > > Ah, the true signs of expertise: developing a knowledge base and
> > > > > insight deep enough to understand that there is always another
> > > > > hidden dragon poised to bite your head off. :)
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@xxxxxxxxxxxxx
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html