Re: zeroed pages in pread result

Sage Weil <sweil@xxxxxxxxxx> · Tue, 10 Apr 2018 20:59:59 +0000 (UTC)

On Tue, 10 Apr 2018, Boaz Harrosh wrote:
> On 10/04/18 21:54, Sage Weil wrote:
> > Hi everyone,
> > 
> > We're tracking down a hard to reproduce failure in Ceph BlueStore where 
> > rocksdb is reading biggish chunks (e.g., 600k) and we're getting zeros in 
> > the resulting buffer, leading to a CRC failure and crash.  The data has 
> > been read several times before without problems, and after the crash the 
> > correct data is on disk as well--it is a transient problem with the read 
> > result.
> > 
> > Our main questions are if this is a known issue, or if anyone with a 
> > better understanding of the O_DIRECT block device MM interactions here has 
> > any theories as to what might be going wrong...
> > 
> > Details below:
> > 
> > - Kernel version is 4.10.0-42-generic (ubuntu 19.04)
> > 
> > - pread on block-aligned extent, reading into page-aligned buffer
> > 
> > - pread always returns the full number of bytes--it's not a short read.
> > 
> > - O_DIRECT
> > 
> > - Usually the zeroed bytes are at the end of the buffer (e.g., last 1/3), 
> > but not always--the most recent time we reproduced it there were 3 
> > distinct zeroed regions in the buffer.
> > 
> > - The zeroed regions are always 4k aligned.
> > 
> > - We always have several other threads (3-5) also doing similar reads at 
> > different offsets of the same file/device.  AFAICS they are 
> > non-overlapping extents.
> > 
> > The one other curious thing is that we tried doing a memset on the buffer 
> > with a non-zero value before the read to see whether pread was skipping 
> > the pages or filling them with zeros...and weren't able to reproduce the 
> > failure.  It's a bit hard to trigger at baseline (it takes anywhere from 
> > hours to days) so we may not have waited long enough.  We're kicking off 
> > another run with memset to try again.
> > 
> > Any theories or suggestions?
> > 
> 
> Hi Sage
> 
> What is the hardware here? which CPU / platform? and what is the block-device
> and driver?
> (Is it always directly from a block-device, or also filesystem over the same device)

The block device in this case is a partition (sdc4) on a raw SCSI device,

[    6.794313] scsi 12:1:0:0: Direct-Access     LSI      NWD-BLP4-1600    0002 PQ: 0 ANSI: 6
[    6.795068] mpt2sas_cm1: WarpDrive : Direct IO is Enabled for the drive with handle(0x0024)
[    6.795069] scsi 12:1:0:0: Set queue's max_sector to: 8192
[    6.826336] sd 12:1:0:0: Attached scsi generic sg3 type 0
[    6.826532] sd 12:1:0:0: [sdc] 3124999680 512-byte logical blocks: (1.60 TB/1.46 TiB)
[    6.826566] sd 12:1:0:0: [sdc] Write Protect is off
[    6.826568] sd 12:1:0:0: [sdc] Mode Sense: 03 00 00 08
[    6.826583] scsi 5:0:0:0: Direct-Access     ATA      ST3250823AS      3.03 PQ: 0 ANSI: 5
[    6.826597] sd 12:1:0:0: [sdc] No Caching mode page found
[    6.826600] sd 12:1:0:0: [sdc] Assuming drive cache: write through
[    6.827906]  sdc: sdc1 sdc2 sdc3 sdc4
[    6.828332] sd 12:1:0:0: [sdc] Attached SCSI disk

This particular box is a 'AMD Ryzen 7 1700 Eight-Core Processor' (HT 
enabled).

> And did you try 4.15 / 4.16 compiled Kernel as well?

Not yet.

> Did you rule out an hardware failure, reproduced on another machine?

A handful of other users have reported the same failure in the wild.  
This is the first time we've been able to reliably reproduce it.  Still 
checking to see what other machines and kernel versions are affected.  
I'm assuming for now that it is the same root cause (the symptom is just a 
failed crc check).

We're checking on the swapped pages theory, and working to reproduce on 
another box.

Thanks!
sage