Re: zeroed pages in pread result

Sage Weil <sweil@xxxxxxxxxx> · Wed, 11 Apr 2018 02:45:52 +0000 (UTC)

On Tue, 10 Apr 2018, Jeff Moyer wrote:
> Sage Weil <sweil@xxxxxxxxxx> writes:
> 
> > Hi everyone,
> >
> > We're tracking down a hard to reproduce failure in Ceph BlueStore where 
> > rocksdb is reading biggish chunks (e.g., 600k) and we're getting zeros in 
> > the resulting buffer, leading to a CRC failure and crash.  The data has 
> > been read several times before without problems, and after the crash the 
> > correct data is on disk as well--it is a transient problem with the read 
> > result.
> >
> > Our main questions are if this is a known issue, or if anyone with a 
> > better understanding of the O_DIRECT block device MM interactions here has 
> > any theories as to what might be going wrong...
> >
> > Details below:
> >
> > - Kernel version is 4.10.0-42-generic (ubuntu 19.04)
> >
> > - pread on block-aligned extent, reading into page-aligned buffer
> >
> > - pread always returns the full number of bytes--it's not a short read.
> >
> > - O_DIRECT
> 
> Is there any other I/O going on to these files?  Are there concurrent
> readers and writers?  Is there a mix of buffered and direct I/O?  Does
> the code that performs the I/O fork()?

Several (3-5) other threads are doing concurrent O_DIRECT reads from the 
same fd at different offsets.

No writers to this region of the device (but there are O_DIRECT writes 
going on elsewhere).  I've already verified they aren't touching these 
ranges (and if they did we would expect to see the on-disk state change, 
but after the failure the device's data is all correct).

All IO is O_DIRECT--nothing buffered (at least not in this process).

No fork(), just multiple pthreads.

sage

> 
> -Jeff
> 
> >
> > - Usually the zeroed bytes are at the end of the buffer (e.g., last 1/3), 
> > but not always--the most recent time we reproduced it there were 3 
> > distinct zeroed regions in the buffer.
> >
> > - The zeroed regions are always 4k aligned.
> >
> > - We always have several other threads (3-5) also doing similar reads at 
> > different offsets of the same file/device.  AFAICS they are 
> > non-overlapping extents.
> >
> > The one other curious thing is that we tried doing a memset on the buffer 
> > with a non-zero value before the read to see whether pread was skipping 
> > the pages or filling them with zeros...and weren't able to reproduce the 
> > failure.  It's a bit hard to trigger at baseline (it takes anywhere from 
> > hours to days) so we may not have waited long enough.  We're kicking off 
> > another run with memset to try again.
> >
> > Any theories or suggestions?
> >
> > Thanks!
> > sage
> 
>