Sage Weil <sweil@xxxxxxxxxx> writes: > Hi everyone, > > We're tracking down a hard to reproduce failure in Ceph BlueStore where > rocksdb is reading biggish chunks (e.g., 600k) and we're getting zeros in > the resulting buffer, leading to a CRC failure and crash. The data has > been read several times before without problems, and after the crash the > correct data is on disk as well--it is a transient problem with the read > result. > > Our main questions are if this is a known issue, or if anyone with a > better understanding of the O_DIRECT block device MM interactions here has > any theories as to what might be going wrong... > > Details below: > > - Kernel version is 4.10.0-42-generic (ubuntu 19.04) > > - pread on block-aligned extent, reading into page-aligned buffer > > - pread always returns the full number of bytes--it's not a short read. > > - O_DIRECT Is there any other I/O going on to these files? Are there concurrent readers and writers? Is there a mix of buffered and direct I/O? Does the code that performs the I/O fork()? -Jeff > > - Usually the zeroed bytes are at the end of the buffer (e.g., last 1/3), > but not always--the most recent time we reproduced it there were 3 > distinct zeroed regions in the buffer. > > - The zeroed regions are always 4k aligned. > > - We always have several other threads (3-5) also doing similar reads at > different offsets of the same file/device. AFAICS they are > non-overlapping extents. > > The one other curious thing is that we tried doing a memset on the buffer > with a non-zero value before the read to see whether pread was skipping > the pages or filling them with zeros...and weren't able to reproduce the > failure. It's a bit hard to trigger at baseline (it takes anywhere from > hours to days) so we may not have waited long enough. We're kicking off > another run with memset to try again. > > Any theories or suggestions? > > Thanks! > sage