On Tue, May 23, 2017 at 12:22:04PM -0400, Brian Foster wrote: > On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote: > > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote: > > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote: > > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > > > > > Adding new flags to the same field that can be asynchronously > > > > > > updated by RMW operations outside the ailp->xa_lock will cause > > > > > > problems in future. There *may not* be a problem right now, but it > > > > > > is poor programming practice to have different coherency processes > > > > > > for the same variable that is updated via RMW operations. In these > > > > > > situations, the only safe way to update the variable is to use an > > > > > > atomic operation. > > > > > > > > > > > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > > > > > items as we would have done anyways if the metadata writeback had > > > > > succeeded and we were removing the log items from the AIL.. > > > > > > > > Yes. the alip->xa_lock protects AIL state is a highly contended > > > > lock. It should not be used for things that aren't AIL related > > > > because that will have performance and scalability implications. > > > > > > > > > > The purpose of this flag is to control AIL retry processing, how is this > > > not AIL related? > > > > It's IO state, not AIL state. IO submission occurs from more places > > than and AIL push (e.g. inode reclaim, inode clustering, etc) and > > there's no way we should be exposing the internal AIL state lock in > > places like that. > > > > That's reasonable, though it suggests different code from what this > patchset currently does. This would be far more useful if we could focus > on correctness of the code first and foremost. If the code we commit > actually warrants this kind of change, then this becomes a much easier > (i.e., pointless :P) discussion from my end. Hmmm - this is not the first time recently you've said something similar in a review discussion where you didn't like this issues have been pointed out. Last time is was "handwavey", now it's "pointless". We haven't worked out a solution that solves all the known problems, so these sorts of comments really don't help close on one.... > As it is, this patch only ever clears the flag on AIL resubmission. ISTM > you are indirectly suggesting that should occur in other contexts as > well (e.g., consider a reclaim and flush of another inode backed by a > buffer that is already failed, and has other inodes sitting in the AIL > flush locked). If that is the case, please elaborate, because otherwise > it is kind of impossible to evaluate and discuss without specifics. All I've done is take the deadlock scenario you pointed out and applied to it other places we block on the flush lock. Go look at inode reclaim - it can do blocking reclaim and so will block on the flush lock. How is this code supposed to handle hitting a failed, flush locked inode? If it just blocks, then it's a likely deadlock vector as you've already pointed out. Hence we'll need the same failed write checks to trigger buffer resubmission before we retry the inode reclaim process. [....] > As it is, this patch would only add an acquisition of ->xa_lock in > completion context on I/O failure. Exactly what workload would you > expect to be affected by changes to I/O error handling? How would it be > any different than what we already do on normal I/O completion? If you want to check and clear the failed flag in the log item in a non-racy way in IO submission (because we have to do this outside the flush lock) then we'd have to hold the AIL lock. And that means the AIL lock is then in the hot path of inode flushing in all situations, not just in IO completion. We've known for a long time that the AIL lock is the second most contended lock in the filesystem (behind the CIL context lock), and that it is most heavily contended on inode IO completion because completion runs on the CPU that the interrupt is delivered to and so the AIL lock gets contended from multiple CPUs at once on multi-disk filesystems (see xfs_iflush_done() for more info). As such, we do not want to increase it's use anywhere in the inode IO submission/completion path at all if we can possibly avoid it. So with all the problems that leaving failed inodes flush locked presents, I'm starting to think that the inode failed callback should remove the inode from the buffer and drop the flush lock rather than leave it held and attached to the buffer. We can check the LI_FAILED flag in xfs_iflush_int() and simply skip the inode flush to the buffer to effectively "resubmit" the inode unchanged. If we then clear the LI_FAILED flag in xfs_iflush_done() (i.e. only when a successful write occurs), we don't need to jump through hoops to clear the flag at IO submission time. We'd probably also need a failed flag on the buffer (maybe with a hold ref, too) so that it isn't reclaimed by memory pressure while we have inodes in the failed state pending writeback. This leaves the rest of the submission code completely unchanged - including the write clustering. It also gets rid of all the flush lock deadlock scenarios because we always unlock the flush lock on IO completion, regardless of whether the buffer IO succeeded or not. ANd we don't have to change any infrastructure, either - it all just works as it is because the LI_FAILED flag is handled at the lowest levels possible. This, to me, seems like a far cleaner solution than anything we've discussed so far.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html