Re: [PATCH] generic: test race between block map change and writeback

Eryu Guan <eguan@xxxxxxxxxx> · Wed, 11 Oct 2017 18:33:43 +0800

On Tue, Oct 10, 2017 at 06:56:22AM -0400, Brian Foster wrote:
> On Tue, Oct 10, 2017 at 04:24:59PM +1100, Dave Chinner wrote:
> > On Tue, Oct 10, 2017 at 12:36:49PM +0800, Eryu Guan wrote:
> > > On Mon, Oct 09, 2017 at 12:12:55PM -0400, Brian Foster wrote:
> > > > On Thu, Aug 31, 2017 at 12:02:37PM +0800, Eryu Guan wrote:
> > > > > Run delalloc writes & append writes & non-data-integrity syncs
> > > > > concurrently to test the race between block map change vs writeback.
> > > > > 
> > > > > This is to cover an XFS bug that data could be written to wrong
> > > > > block and delay allocated blocks are leaked because the block map
> > > > > was changed due to the removal of speculative allocated eofblocks
> > > > > when writeback is in progress.
> > > > > 
> > > > > And this test partially mimics what lustre-racer[1] test does, using
> > > > > which this bug was first found.
> > > > > 
> > > > > [1] https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=tree;f=lustre/tests/racer;hb=HEAD
> > > > > 
> > > > > Signed-off-by: Eryu Guan <eguan@xxxxxxxxxx>
> > > > > ---
> > > > > 
> > > > > This may not reproduce the bug on all hosts, but it does reproduce the XFS
> > > > > corruption issue reliably on my different test hosts.
> > > > > 
> > > > 
> > > > Was this problem fixed already or are we still waiting on a fix?
> > > 
> > > It's still an unfixed problem. Dave provided a test patch (which did fix
> > > the bug for me)
> > 
> > The test patch I provided broken the COW writeback path, primarily
> > because it's a separate mapping path and the change I made doesn't
> > work at all well with it....
> > 
> > > then Christoph suggested a fix based on seqlock, and
> > > things stalled there.
> > 
> > I had a look at doing that and got stalled on the fact that, again,
> > the COW writeback is completely separate to the existing block
> > mapping during writeback path and so applying a seqlock algorithm is
> > pretty difficult.
> > 
> > Basically, to fix the problem, we first need to merge the COW and
> > delalloc paths in the writepage code and then we'll have a sane base
> > on which to apply a proper fix...
> > 
> > (we need to do this to get rid of the bufferhead dependency, anyway)
> > 
> > > (I'm happy to pick up the work, but I'm not that
> > > familiar with all the allocation paths that could change the extent map,
> > > so I may need some guidance and time to play with it.)
> > 
> > There's some black magic in amongst it all. I'll spend some time on
> > it again over the next week and see what I come up with...
> > 
> 
> Hmm, is this[1] the test patch/thread associated with this test case? If
> so, I'm still wondering why we can't just trim the mapping to eof like
> the previous code had effectively done for so long..? Eryu, does the
> appended diff address this test case?

Yes, the appended patch fixed my test failure, it survived 20+
iterations for me.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html