Re: Silent data corruption in blkdev_direct_IO()

Martin Wilck <mwilck@xxxxxxxx> · Fri, 13 Jul 2018 08:47:29 +0200

On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
> On 7/12/18 10:20 AM, Jens Axboe wrote:
> > On 7/12/18 10:14 AM, Hannes Reinecke wrote:
> > > On 07/12/2018 05:08 PM, Jens Axboe wrote:
> > > > On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> > > > > Hi Jens, Christoph,
> > > > > 
> > > > > we're currently hunting down a silent data corruption
> > > > > occurring due to
> > > > > commit 72ecad22d9f1 ("block: support a full bio worth of IO
> > > > > for
> > > > > simplified bdev direct-io").
> > > > > 
> > > > > While the whole thing is still hazy on the details, the one
> > > > > thing we've
> > > > > found is that reverting that patch fixes the data corruption.
> > > > > 
> > > > > And looking closer, I've found this:
> > > > > 
> > > > > static ssize_t
> > > > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> > > > > {
> > > > > 	int nr_pages;
> > > > > 
> > > > > 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > > 	if (!nr_pages)
> > > > > 		return 0;
> > > > > 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > > 		return __blkdev_direct_IO_simple(iocb, iter,
> > > > > nr_pages);
> > > > > 
> > > > > 	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > > BIO_MAX_PAGES));
> > > > > }
> > > > > 
> > > > > When checking the call path
> > > > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> > > > > I found that bvec_alloc() will fail if nr_pages >
> > > > > BIO_MAX_PAGES.
> > > > > 
> > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> > > > > It's not that we can handle it in __blkdev_direct_IO() ...
> > > > 
> > > > The logic could be cleaned up like below, the sync part is
> > > > really all
> > > > we care about. What is the test case for this? async or sync?
> > > > 
> > > > I also don't remember why it's BIO_MAX_PAGES + 1...
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 0dd87aaeb39a..14ef3d71b55f 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb,
> > > > struct iov_iter *iter)
> > > >   {
> > > >   	int nr_pages;
> > > >   
> > > > -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
> > > >   	if (!nr_pages)
> > > >   		return 0;
> > > > -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > +	if (is_sync_kiocb(iocb))
> > > >   		return __blkdev_direct_IO_simple(iocb, iter,
> > > > nr_pages);
> > > >   
> > > > -	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > BIO_MAX_PAGES));
> > > > +	return __blkdev_direct_IO(iocb, iter, nr_pages);
> > > >   }
> > > >   
> > > >   static __init int blkdev_init(void)
> > > > 
> > > 
> > > Hmm. We'll give it a go, but somehow I feel this won't solve our
> > > problem.
> > 
> > It probably won't, the only joker here is the BIO_MAX_PAGES + 1.
> > But it
> > does simplify that part...
> 
> OK, now I remember. The +1 is just to check if there are actually
> more
> pages. __blkdev_direct_IO_simple() only does one bio, so it has to
> fit
> within that one bio. __blkdev_direct_IO() will loop just fine and
> will finish any size, BIO_MAX_PAGES at the time.

Right. Hannes, I think we (at least myself) have been irritated by
looking at outdated code. The key point which I missed is that
__blkdev_direct_IO() is called with min(nr_pages, BIO_MAX_PAGES), and
advances beyond that later in the loop.

> Hence the patch I sent is wrong, the code actually looks fine. Which
> means we're back to trying to figure out what is going on here. It'd
> be great with a test case...

Unfortunately we have no reproducer just yet. Only the customer can
reproduce it. The scenario is a data base running on a KVM virtual
machine on top of a virtio-scsi volume backed by a multipath map, with
cache='none' in qemu.

My late thinking is that if, for whatever reason I don't yet
understand, blkdev_direct_IO() resulted in a short write,
__generic_file_write_iter() would fall back to buffered writing, which
might be a possible explanation for the data corruption we observe.
That's just speculation at the current stage.

Regards
Martin

-- 
Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)