Re: zeroed pages in pread result

Jeff Moyer <jmoyer@xxxxxxxxxx> · Tue, 10 Apr 2018 18:04:20 -0400

Jeff Moyer <jmoyer@xxxxxxxxxx> writes:

> Sage Weil <sweil@xxxxxxxxxx> writes:
>
>> Hi everyone,
>>
>> We're tracking down a hard to reproduce failure in Ceph BlueStore where 
>> rocksdb is reading biggish chunks (e.g., 600k) and we're getting zeros in 
>> the resulting buffer, leading to a CRC failure and crash.  The data has 
>> been read several times before without problems, and after the crash the 
>> correct data is on disk as well--it is a transient problem with the read 
>> result.
>>
>> Our main questions are if this is a known issue, or if anyone with a 
>> better understanding of the O_DIRECT block device MM interactions here has 
>> any theories as to what might be going wrong...
>>
>> Details below:
>>
>> - Kernel version is 4.10.0-42-generic (ubuntu 19.04)
>>
>> - pread on block-aligned extent, reading into page-aligned buffer
>>
>> - pread always returns the full number of bytes--it's not a short read.
>>
>> - O_DIRECT
>
> Is there any other I/O going on to these files?  Are there concurrent
> readers and writers?  Is there a mix of buffered and direct I/O?  Does
> the code that performs the I/O fork()?
>
> -Jeff
>
>>
>> - Usually the zeroed bytes are at the end of the buffer (e.g., last 1/3), 
>> but not always--the most recent time we reproduced it there were 3 
>> distinct zeroed regions in the buffer.
>>
>> - The zeroed regions are always 4k aligned.
>>
>> - We always have several other threads (3-5) also doing similar reads at 
>> different offsets of the same file/device.  AFAICS they are 
>> non-overlapping extents.
>>
>> The one other curious thing is that we tried doing a memset on the buffer 
>> with a non-zero value before the read to see whether pread was skipping 
>> the pages or filling them with zeros...and weren't able to reproduce the 
>> failure.  It's a bit hard to trigger at baseline (it takes anywhere from 
>> hours to days) so we may not have waited long enough.  We're kicking off 
>> another run with memset to try again.

In fact, this may jive with the fork() theory...maybe.

See open(2):
       O_DIRECT I/Os should never be run concurrently with the fork(2)
       system call, if the memory buffer is a private mapping (i.e., any
       mapping cre- ated with the mmap(2) MAP_PRIVATE flag; this
       includes memory allocated on the heap and statically allocated
       buffers).  Any such I/Os, whether submitted via an asynchronous
       I/O interface or from another thread in the process, should be
       completed before fork(2) is called.  Failure to do so can result
       in data corruption and undefined behavior in parent and child
       processes.  This restriction does not apply when the memory
       buffer for the O_DIRECT I/Os was created using shmat(2) or
       mmap(2) with the MAP_SHARED flag.  Nor does this restriction
       apply when the memory buffer has been advised as MADV_DONTFORK
       with madvise(2), ensuring that it will not be available to the
       child after fork(2).

and from madvise(2):
       MADV_DONTFORK (Since Linux 2.6.16) Do not make the pages in this
       range available to the child after a fork(2).  This is useful to
       prevent copy-on-write semantics from changing the physical
       location of a page(s) if the parent writes to it after a fork(2).
       (Such page relocations cause problems for hardware that DMAs into
       the page(s).)

Try calling madvise with MADV_DONTFORK on your buffers and see if the
problem goes away.

Cheers,
Jeff