On Sat, Aug 16, 2014 at 09:58:04AM +1000, Dave Chinner wrote: > On Fri, Aug 15, 2014 at 12:13:20PM -0400, Brian Foster wrote: > > On Fri, Aug 15, 2014 at 04:39:06PM +1000, Dave Chinner wrote: > > > + if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) { > > > + xfs_buf_ioerror(bp, -EIO); > > > + bp->b_flags &= ~XBF_DONE; > > > + xfs_buf_stale(bp); > > > + xfs_buf_ioend(bp); > > > + return; > > > + } > > > > > > if (bp->b_flags & XBF_WRITE) > > > xfs_buf_wait_unpin(bp); > > > @@ -1311,25 +1307,10 @@ xfs_buf_iorequest( > > > bp->b_io_error = 0; > > > > > > > I know this is from the previous patch, but I wonder if it's cleaner to > > reset b_io_error when the error is transferred to b_error. That seems > > slightly more future proof at least. > > I much prefer zeroing just before the variable is needed to be zero, > simply to indicate the context in which we care that the value is > correct. Outside of actively submitted IO, the value of b_io_error > is irrelevant, so it's value really doesn't matter. The other > advantage of leaving it untocuhed is for debug purposes - the caller > might clear b_error, but we still know what the state of the last IO > that was completed in the buffer was... > Sounds good, I don't really have a strong preference. > > > > > /* > > > - * Take references to the buffer. For XBF_ASYNC buffers, holding a > > > - * reference for as long as submission takes is all that is necessary > > > - * here. The IO inherits the lock and hold count from the submitter, > > > - * and these are release during IO completion processing. Taking a hold > > > - * over submission ensures that the buffer is not freed until we have > > > - * completed all processing, regardless of when IO errors occur or are > > > - * reported. > > > - * > > > - * However, for synchronous IO, the IO does not inherit the submitters > > > - * reference count, nor the buffer lock. Hence we need to take an extra > > > - * reference to the buffer for the for the IO context so that we can > > > - * guarantee the buffer is not freed until all IO completion processing > > > - * is done. Otherwise the caller can drop their reference while the IO > > > - * is still in progress and hence trigger a use-after-free situation. > > > + * Take an extra reference so that we don't have to guess when it's no > > > + * longer safe to reference the buffer that was passed to us. > > > */ > > > > Assuming my understanding is correct: > > > > /* > > * The caller's reference is released by buffer I/O completion. Technically > > * this should not occur until we release the last b_io_remaining reference. > > That's not quite right. The caller's reference is released some time > *after* b_io_remaining goes to zero. That's the reason we need to > hold a reference - after we drop our b_io_remaining count, we have > to have some other method of ensuring the buffer doesn't go away > until we have finished with the buffer. > Right, that's what I meant by the fact that we have to release the last b_io_remaining ref one way or another first... > I'll rewrite the comment. > Ok, the rest of the comment I wrote isn't really clear when I re-read it back either. > > > +xfs_buf_submit_wait( > > > + struct xfs_buf *bp) > > > { > > > - trace_xfs_buf_iowait(bp, _RET_IP_); > > > + int error; > > > + > > > + trace_xfs_buf_iorequest(bp, _RET_IP_); > > > > > > - if (!bp->b_error) > > > - wait_for_completion(&bp->b_iowait); > > > + ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC))); > > > + > > > + if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) { > > > + xfs_buf_ioerror(bp, -EIO); > > > + xfs_buf_stale(bp); > > > + bp->b_flags &= ~XBF_DONE; > > > + return -EIO; > > > + } > > > > > > + if (bp->b_flags & XBF_WRITE) > > > + xfs_buf_wait_unpin(bp); > > > + > > > + /* clear the internal error state to avoid spurious errors */ > > > + bp->b_io_error = 0; > > > + > > > + /* > > > + * For synchronous IO, the IO does not inherit the submitters reference > > > + * count, nor the buffer lock. Hence we cannot release the reference we > > > + * are about to take until we've waited for all IO completion to occur, > > > + * including any xfs_buf_ioend_async() work that may be pending. > > > + */ > > > + xfs_buf_hold(bp); > > > + > > > > Harmless, but why is this necessary? The caller should have the > > reference, the I/O completion won't release it and we wait on b_iowait > > before we return. Isn't the caller's reference sufficient? > > Consistency - I'd prefer that all IO has the same reference counting > behaviour. i.e. that the IO holds it's own reference to ensure, > regardless of anything else that happens, that the buffer has a > valid reference count the entire time the IO subsystem processing > the buffer. > Sounds good. > > > @@ -1838,7 +1853,10 @@ xfs_buf_delwri_submit( > > > bp = list_first_entry(&io_list, struct xfs_buf, b_list); > > > > > > list_del_init(&bp->b_list); > > > - error2 = xfs_buf_iowait(bp); > > > + > > > + /* locking the buffer will wait for async IO completion. */ > > > + xfs_buf_lock(bp); > > > + error2 = bp->b_error; > > > > Interesting delwri cleanup overall. The lock here works for > > synchronization (blocking), but it doesn't look safe for error > > processing. Once the buffer lock is dropped, who says b_error is from > > our write (and not a subsequent, for example) and thus this caller's > > responsibility to handle the error? I suspect this is a reason the lock > > is typically held across a sync I/O, so the error is valid after the > > I/O. > > It's fine because there is only a limited set of blocking callers, > all of which are special cases. The only callers that use this > blocking xfs_buf_delwri_submit() interface are: > > 1. log recovery: running single threaded, so isn't going > to be racing with anything else that can modify the error > > 2. quotacheck: also running single threaded > > 3. quota shrinker, which has nowhere to report an error to, > so the error status simply doesn't matter. > Sure, to be honest I wasn't really expecting there to be a scenario where this is currently possible. I figured the log recovery context was obviously not a concern because that occurs on mount. I hadn't gone to look at the other contexts where we use delwri. quotacheck makes sense for the same reason. We also use delwri for AIL pushing, but not the synchronous version. My comment was more from the perspective of this code will be around for a long time and this little characteristic of the mechanism is not at all explicit or obvious. So it won't ever be a problem until somebody uses sync delwri in a context where racing is possible. Then it won't ever reproduce until some user drives said mechanism and hits an error in just the right manner to lead to some undefined error recovery behavior. Maybe that doesn't happen for 20 years, but whenever it does, it's guaranteed to not be fun to debug. ;) So my point is not to suggest there's a race somewhere. It's just that the mechanism is racy and I'd prefer we eliminate the possibility of having to debug the flaw that this leaves behind. :) Another way to look at it is that we can't ever use this delwri mechanism from anywhere but such special, isolated contexts going forward. If we ever find that we want to, that punts the problem to a prereq for whatever work that happens to be. It's not clear to me if there's a way to deal with that without fundamentally changing this mechanism. Anything explicit probably just adds ugliness that's dedicated specifically to the fact that this is racy (e.g., delwri specific lock/counter), so I don't think we want to go there. It seems like the more general problem is that we have two I/O models (sync and async) that are fundamentally incompatible, but here we try to build a batch sync I/O mechanism on top of the async submission mechanism. So the definition of the async model is no longer sufficient for our use, since we clearly have a use case to wait on an async I/O. Perhaps we should think more about making the async/sync mechanisms more generic/compatible..? Thinking out loud, making the lock context dynamically transferrable to the I/O might be an interesting experiment to start to decouple things (much like we do for joining items to transactions, for example). FWIW, I'm Ok with deferring that or adding it to my own todo list based on the fact that there are currently no racing contexts, so long as the async is a subset of sync I/O model is generally acceptable. > > Also, it looks like blocking on async I/O as such opens up the > > possibility of blocking indefinitely on failing writes if the right > > combination of delwri and b_iodone handler is used (see > > xfs_buf_iodone_callbacks()). I'm not sure if that's a real problem > > today, though. > > I don't think it is - we don't permanently rewrite metadata writes > from IO completion anymore. We retry once, but then require higher > layers to restart the IO again (e.g. the xfsaild). Hence we can't > get stuck forever on async write errors. > Ok, we have the XBF_WRITE_FAIL thing now. Forgot about that. Brian > > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > > > index 4ba19bf..1e14452 100644 > > > --- a/fs/xfs/xfs_log_recover.c > > > +++ b/fs/xfs/xfs_log_recover.c > > > @@ -193,12 +193,8 @@ xlog_bread_noalign( > > > bp->b_io_length = nbblks; > > > bp->b_error = 0; > > > > > > - if (XFS_FORCED_SHUTDOWN(log->l_mp)) > > > - return -EIO; > > > - > > > - xfs_buf_iorequest(bp); > > > - error = xfs_buf_iowait(bp); > > > - if (error) > > > + error = xfs_buf_submit_wait(bp); > > > + if (error && !XFS_FORCED_SHUTDOWN(log->l_mp)) > > > xfs_buf_ioerror_alert(bp, __func__); > > > return error; > > > } > > > @@ -4427,16 +4423,12 @@ xlog_do_recover( > > > XFS_BUF_UNASYNC(bp); > > > bp->b_ops = &xfs_sb_buf_ops; > > > > > > - if (XFS_FORCED_SHUTDOWN(log->l_mp)) { > > > - xfs_buf_relse(bp); > > > - return -EIO; > > > - } > > > - > > > - xfs_buf_iorequest(bp); > > > - error = xfs_buf_iowait(bp); > > > + error = xfs_buf_submit_wait(bp); > > > if (error) { > > > - xfs_buf_ioerror_alert(bp, __func__); > > > - ASSERT(0); > > > + if (XFS_FORCED_SHUTDOWN(log->l_mp)) { > > > > Should this be !XFS_FORCED_SHUTDOWN()? > > Right, good catch, especially after reading all that code ;) > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs