On Fri, Apr 24, 2020 at 07:38:39AM +1000, Dave Chinner wrote: > On Thu, Apr 23, 2020 at 10:36:37AM -0400, Brian Foster wrote: > > On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote: > > > On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote: > > > > The log item failed flag is used to indicate a log item was flushed > > > > but the underlying buffer was not successfully written to disk. If > > > > the error configuration allows for writeback retries, xfsaild uses > > > > the flag to resubmit the underlying buffer without acquiring the > > > > flush lock of the item. > > > > > > > > The flag is currently set and cleared under the AIL lock and buffer > > > > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O > > > > completion time. The flag is checked at xfsaild push time under AIL > > > > lock and cleared under buffer lock before resubmission. If I/O > > > > eventually succeeds, the flag is cleared in the _done() handler for > > > > the associated item type, again under AIL lock and buffer lock. > > > > > > Actually, I think you've missed the relevant lock here: the item's > > > flush lock. The XFS_LI_FAILED flag is item flush state, and flush > > > state is protected by the flush lock. When the item has been flushed > > > and attached to the buffer for completion callbacks, the flush lock > > > context gets handed to the buffer. > > > > > > i.e. the buffer owns the flush lock and so while that buffer is > > > locked for IO we know that the item flush state (and hence the > > > XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer > > > lock. > > > > > > > Right.. > > > > > (Note: this is how xfs_ifree_cluster() works - it grabs the buffer > > > lock then walks the items on the buffer and changes the callback > > > functions because those items are flush locked and hence holding the > > > buffer lock gives exclusive access to the flush state of those > > > items....) > > > > > > > As far as I can tell, the only reason for holding the AIL lock > > > > across sets/clears is to manage consistency between the log item > > > > bitop state and the temporary buffer pointer that is attached to the > > > > log item. The bit itself is used to manage consistency of the > > > > attached buffer, but is not enough to guarantee the buffer is still > > > > attached by the time xfsaild attempts to access it. > > > > > > Correct. The guarantee that the buffer is still attached to the log > > > item is what the AIL lock provides us with. > > > > > > > However since > > > > failure state is always set or cleared under buffer lock (either via > > > > I/O completion or xfsaild), this particular case can be handled at > > > > item push time by verifying failure state once under buffer lock. > > > > > > In theory, yes, but there's a problem before you get that buffer > > > lock. That is: what serialises access to lip->li_buf? > > > > > > > That's partly what I was referring to above by the push time changes. > > This patch was an attempt to replace reliance on ail_lock with a push > > time sequence that would serialize access to a failed buffer > > (essentially relying on the failed bit). Whether it's correct or not is > > another matter... ;) > > > > > The xfsaild does not hold a reference to the buffer and, without the > > > AIL lock to provide synchronisation, the log item reference to the > > > buffer can be dropped asynchronously by buffer IO completion. Hence > > > the buffer could be freed between the xfsaild reading lip->li_buf > > > and calling xfs_buf_trylock(bp). i.e. this introduces a > > > use-after-free race condition on the initial buffer access. > > > > > > > Hmm.. the log item holds a temporary reference to the buffer when the > > failed state is set. On submission, xfsaild queues the failed buffer > > (new ref for the delwri queue), clears the failed state and drops the > > failure reference of every failed item that is attached. xfsaild is also > > the only task that knows how to process LI_FAILED items, so I don't > > think we'd ever race with a failed buffer becoming "unfailed" from > > xfsaild (which is the scenario where a buffer could disappear from I/O > > completion). > > Sure we can. every inode on the buffer has XFS_LI_FAILED set on it, > so the first inode in the AIL triggers the resubmit and starts the > IO. the AIL runs again, coming across another inode on the buffer > further into the AIL. That inode has been modified in memory while > the flush was in progress, so it no longer gets removed from the AIL > by the IO completion. Instead, xfs_clear_li_failed() is called from > xfs_iflush_done() and the buffer is removed from the logitem and the > reference dropped. > When xfsaild encounters the first failed item, the buffer resubmit path (xfsaild_resubmit_item(), as of this series) calls xfs_clear_li_failed() on every inode attached to the buffer. AFAICT, that means xfsaild will see any associated inode as FLUSHING (or PINNED, if another in-core modification has come through) once the buffer is requeued. > > In some sense, clearing LI_FAILED of an item is essentially > > retaking ownership of the item flush lock and hold of the underlying > > buffer, but the existing code is certainly not written that way. > > Only IO completion clears the LI_FAILED state of the item, not the > xfsaild. IO completion already owns the flush lock - the xfsaild > only holds it long enough to flush an inode to the buffer and then > give the flush lock to the buffer. > > Also, we clear LI_FAILED when removing the item from the AIL under > the AIL lock, so the only case here that we are concerned about is > the above "inode was relogged while being flushed" situation. THis > situation is rare, failed buffers are rare, and so requiring the AIL > lock to be held is a performance limiting factor here... > Yep, though because of the above I think I/O completion clearing the failed state might require a more subtle dance between xfsaild and another submission context, such as if reclaim happens to flush an inode that wasn't already attached to a previously failed buffer (for example). Eh, it doesn't matter that much in any event because.. > > The issue I was wrestling with wrt to this patch was primarily > > maintaining consistency between the bit and the assignment of the > > pointer on a failing I/O. E.g., the buffer pointer is set by the task > > that sets the bit, but xfsaild checking the bit doesn't guarantee the > > pointer had been set yet. I did add a post-buffer lock LI_FAILED check > > as well, but that probably could have been an assert. > > SO you've been focussing on races that may occur when the setting of > the li_buf, but I'm looking at races involving the clearing of the > li_buf pointer. :) > > > > IOWs, the xfsaild cannot access lip->li_buf safely unless the > > > set/clear operations are protected by the same lock the xfsaild > > > holds. The xfsaild holds neither the buffer lock, a buffer reference > > > or an item flush lock, hence it's the AIL lock or nothing... > > > > > > > Yeah, that was my impression from reading the code. I realize from this > > discussion that this doesn't actually simplify the logic. That was the > > primary goal, so I think I need to revisit the approach here. Even if > > this is correct (which I'm still not totally sure of), it's fragile in > > the sense that it partly relies on single-threadedness of xfsaild. I > > initially thought about adding a log item spinlock for this kind of > > thing, but it didn't (and still doesn't) seem appropriate to burden > > every log item in the system with a spinlock for the rare case of buffer > > I/O errors. > > I feel this is all largely a moot, because my patches result in this > this whole problem of setting/clearing the li_buf for failed buffers > go away altogether. Hence I would suggest that this patch gets > dropped for now, because getting rid of the AIL lock is much less > troublesome once LI_FAILED is no long dependent on the inode/dquot > log item taking temporary references to the underlying buffer.... > ... I'm good with that. ;) This was intended to be a low level cleanup to eliminate a fairly minor ail_lock abuse. Working around the ->li_buf thing is the primary challenge, so if you have a broader rework that addresses that as a side effect then that presumably makes things easier. I'll revisit this if necessary once your work is further along... Brian > > > > I mentioned in an earlier patch that it would be nice to consider > > removing ->li_buf entirely but hadn't thought it through. > > I'm going the opposite way entirely, such that LI_FAILED doesn't > have any special state associated with it at all... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >