On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote: > On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote: > > On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote: > > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote: > > > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote: > ..... > > > >> But I believe, my analysis shows that during the mount sequence XFS does not > > > >> wait properly for all the bios to complete, before failing the mount > > > >> sequence back to the caller. > > > >> > > > > > > > > As an experiment, what about the following? Compile tested only and not > > > > safe for general use. > ... > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index cd7b8ca..fbcf524 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest( > > > > * case nothing will ever complete. It returns the I/O error code, if any, or > > > > * 0 if there was no error. > > > > */ > > > > -int > > > > -xfs_buf_iowait( > > > > - xfs_buf_t *bp) > > > > +static int > > > > +__xfs_buf_iowait( > > > > + struct xfs_buf *bp, > > > > + bool skip_error) > > > > { > > > > trace_xfs_buf_iowait(bp, _RET_IP_); > > > > > > > > - if (!bp->b_error) > > > > + if (skip_error || !bp->b_error) > > > > wait_for_completion(&bp->b_iowait); > > > > > > > > trace_xfs_buf_iowait_done(bp, _RET_IP_); > > > > return bp->b_error; > > > > } > > > > > > > > +int > > > > +xfs_buf_iowait( > > > > + struct xfs_buf *bp) > > > > +{ > > > > + return __xfs_buf_iowait(bp, false); > > > > +} > > > > + > > > > xfs_caddr_t > > > > xfs_buf_offset( > > > > xfs_buf_t *bp, > > > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit( > > > > bp = list_first_entry(&io_list, struct xfs_buf, b_list); > > > > > > > > list_del_init(&bp->b_list); > > > > - error2 = xfs_buf_iowait(bp); > > > > + error2 = __xfs_buf_iowait(bp, true); > > > > xfs_buf_relse(bp); > > > > if (!error) > > > > error = error2; > > Not waiting here on buffer error should not matter. Any buffer that > is under IO and requires completion should be referenced, and that > means it should be caught and waited on by xfs_wait_buftarg() in the > mount failure path after log recovery fails. > I think that assumes the I/O is successful. Looking through xlog_recover_buffer_pass2() as an example, we read the buffer which should return with b_hold == 1. The delwri queue increments the hold and we xfs_buf_relse() in the return path (i.e., buffer is now held by the delwri queue awaiting submission). Sometime later we delwri_submit()... xfs_buf_iorequest() does an xfs_buf_hold() and xfs_buf_rele() within that single function. The delwri_submit() releases its hold after xfs_buf_iowait(), which I guess at that point bp should go onto the lru (b_hold back to 1 in xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this point. So unless I miss something or got the lifecycle wrong here, which is easily possible ;), this all hinges on xfs_buf_iowait(). That's where the last hold forcing the buffer to stay around goes away. xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If xfs_buf_iowait() is racy in the event of I/O errors via the bio callback, I think this path is susceptible just the same. > > > > --- > > > I think that this patch fixes the problem. I tried reproducing it like > > > 30 times, and it doesn't happen with this patch. Dropping this patch > > > reproduces the problem within 1 or 2 tries. Thanks! > > > What are next steps? How to make it "safe for general use"? > > > > > > > Ok, thanks for testing. I think that implicates the caller bypassing the > > expected blocking with the right sequence of log recovery I/Os and > > device failure. TBH, I'd still like to see the specifics, if possible. > > Could you come up with a generic reproducer for this? I think a metadump > > of the fs with the dirty log plus whatever device failure simulation > > hack you're using would suffice. > > The real issue is we don't know exactly what code is being tested > (it's 3.8 + random bug fix backports + custom code). Even if we have > a reproducer there's no guarantee it will reproduce on a current > kernel. IOWs, we are stumbling around in the dark bashing our heads > against everything in the room, and that just wastes everyone's > time. > > We need a reproducer that triggers on a current, unmodified > kernel release. You can use dm-faulty to error out all writes just > like you are doing with your custom code. See > xfstests::tests/generic/321 and common/dmflakey for to do this. > Ideally the reproducer is in a form that xfstests can use.... > > If you can't reproduce it on an upstream kernel, then git bisect is > your friend. It will find the commit that fixed the problem you are > seeing.... > Ugh, yeah. The fact that this was customized as such apparently went over my head. I agree completely. This needs to be genericized to a pristine, preferably current kernel. The experiment patch could be papering over something completely different. > > The ideal fix is not yet clear to me. > > We are not even that far along - the root cause of the bug is not at > all clear to me. :/ > Yeah.. the above was just the theory that motivated the experiment in the previously posted patch. It of course remains a theory until we can see the race in action. I was referring to the potential fix for the raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone handling, while still asking for a reproducer to confirm the actual problem. FWIW, I'm not too high on changes in the buf management code, even a smallish behavior change, without a real trace of some sort that documents the problem and justifies the change. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs