zeroed pages in pread result

Sage Weil <sweil@xxxxxxxxxx> · Tue, 10 Apr 2018 18:54:58 +0000 (UTC)

Hi everyone,

We're tracking down a hard to reproduce failure in Ceph BlueStore where 
rocksdb is reading biggish chunks (e.g., 600k) and we're getting zeros in 
the resulting buffer, leading to a CRC failure and crash.  The data has 
been read several times before without problems, and after the crash the 
correct data is on disk as well--it is a transient problem with the read 
result.

Our main questions are if this is a known issue, or if anyone with a 
better understanding of the O_DIRECT block device MM interactions here has 
any theories as to what might be going wrong...

Details below:

- Kernel version is 4.10.0-42-generic (ubuntu 19.04)

- pread on block-aligned extent, reading into page-aligned buffer

- pread always returns the full number of bytes--it's not a short read.

- O_DIRECT

- Usually the zeroed bytes are at the end of the buffer (e.g., last 1/3), 
but not always--the most recent time we reproduced it there were 3 
distinct zeroed regions in the buffer.

- The zeroed regions are always 4k aligned.

- We always have several other threads (3-5) also doing similar reads at 
different offsets of the same file/device.  AFAICS they are 
non-overlapping extents.

The one other curious thing is that we tried doing a memset on the buffer 
with a non-zero value before the read to see whether pread was skipping 
the pages or filling them with zeros...and weren't able to reproduce the 
failure.  It's a bit hard to trigger at baseline (it takes anywhere from 
hours to days) so we may not have waited long enough.  We're kicking off 
another run with memset to try again.

Any theories or suggestions?

Thanks!
sage