On Tue, 10 Apr 2018, Jeff Moyer wrote: > Sage Weil <sweil@xxxxxxxxxx> writes: > > > Hi everyone, > > > > We're tracking down a hard to reproduce failure in Ceph BlueStore where > > rocksdb is reading biggish chunks (e.g., 600k) and we're getting zeros in > > the resulting buffer, leading to a CRC failure and crash. The data has > > been read several times before without problems, and after the crash the > > correct data is on disk as well--it is a transient problem with the read > > result. > > > > Our main questions are if this is a known issue, or if anyone with a > > better understanding of the O_DIRECT block device MM interactions here has > > any theories as to what might be going wrong... > > > > Details below: > > > > - Kernel version is 4.10.0-42-generic (ubuntu 19.04) > > > > - pread on block-aligned extent, reading into page-aligned buffer > > > > - pread always returns the full number of bytes--it's not a short read. > > > > - O_DIRECT > > Is there any other I/O going on to these files? Are there concurrent > readers and writers? Is there a mix of buffered and direct I/O? Does > the code that performs the I/O fork()? Several (3-5) other threads are doing concurrent O_DIRECT reads from the same fd at different offsets. No writers to this region of the device (but there are O_DIRECT writes going on elsewhere). I've already verified they aren't touching these ranges (and if they did we would expect to see the on-disk state change, but after the failure the device's data is all correct). All IO is O_DIRECT--nothing buffered (at least not in this process). No fork(), just multiple pthreads. sage > > -Jeff > > > > > - Usually the zeroed bytes are at the end of the buffer (e.g., last 1/3), > > but not always--the most recent time we reproduced it there were 3 > > distinct zeroed regions in the buffer. > > > > - The zeroed regions are always 4k aligned. > > > > - We always have several other threads (3-5) also doing similar reads at > > different offsets of the same file/device. AFAICS they are > > non-overlapping extents. > > > > The one other curious thing is that we tried doing a memset on the buffer > > with a non-zero value before the read to see whether pread was skipping > > the pages or filling them with zeros...and weren't able to reproduce the > > failure. It's a bit hard to trigger at baseline (it takes anywhere from > > hours to days) so we may not have waited long enough. We're kicking off > > another run with memset to try again. > > > > Any theories or suggestions? > > > > Thanks! > > sage > >