RE: Adding compression/checksum support for bluestore.

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 6 Apr 2016 15:47:41 +0000

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@xxxxxxxxxxxx]
> Sent: Tuesday, April 05, 2016 11:39 PM
> To: Sage Weil <sage@xxxxxxxxxxxx>
> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Igor Fedotov
> <ifedotov@xxxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Wed, Apr 06, 2016 at 01:10:30AM +1000, Chris Dunlop wrote:
> > On Tue, Apr 05, 2016 at 08:35:43AM -0400, Sage Weil wrote:
> >> D = block size, U = hw UBER, C = checksum.  Let's add N = number of
> >> bits you actually want to read.  In that case, we have to read (N /
> >> D) blocks of D bits, and we get
> >>
> >> P(reading N bits and getting some bad data and not knowing it)
> >> 	= (D * U) / (2^C) * (N / D)
> >> 	= U * N / 2^C
> >>
> >> and the D term (block size) disappears.  IIUC this is what Chris was
> >> originally getting at.  The block size affects the probability I get
> >> an error on one block, but if I am a user reading something, you
> >> don't care about block size--you care about how much data you want to
> >> read.  I think in that case it doesn't really matter (modulo rounding
> >> error, minimum read size, how precisely we can locate the error, etc.).
> >>
> >> Is that right?
> >
> > Yep, that's pretty much what I was thinking. My probability algebra is
> > more than a little rusty but that looks like the formula I was
> > starting to put together.
> 
> OK, here's my (1st!, more below) working:
> 
> P(bad bit)
>   = probability bit is bad (BER)
>   = U
> 
> P(bad check)
>   = probability checksum fails (i.e. a false match)
>   = 2^-C
> 
> P(bit fail)
>   = probability bit is bad, and subsequent checksum fail
>   = P(bad bit) * P(bad check)
> 
> P(bit ok)
>   = probability bit is good, or flagged by checksum
>   = 1 - P(bit fail)
>   = 1 - (P(bad bit) * P(bad check))
> 
> P(block ok)
>   = probability D-bit block is all-bits-good, or flagged by checksum
>   = P(bit ok) ^ D
>   = (1 - (P(bad bit) * P(bad check))) ^ D
>   = (1 - (U * 2^-C)) ^ D

Well, that's correct if you have a "C" bit checksum for each bit. It's not ok if the checksum is per-block.

> 
> N  = number of bits in data in which we're interested
> 
> P(data ok)
>   = probability data is ok (good, or flagged)
>   = P(block ok) ^ (number_of_blocks)
>   = P(block ok) ^ (N / D)
>   = ((1 - (P(bad bit) * P(bad check))) ^ D) ^ (N / D)
>   = (1 - (P(bad bit) * P(bad check))) ^ N
>   = (1 - (U * 2^-C)) ^ N
> 
> P(bad data)
>   = probability of getting unflagged bad data
>   = 1 - P(data ok)
>   = 1 - ((1 - (U * 2^-C)) ^ N)
> 
> Like Sage's working, the size of the individual data blocks disappears. But
> unfortunately it doesn't match Sage's formula. :-(
> 
> OK, here's another attempt, looking at it a little differently: my 1st working
> applies the P(bad check) at the bit level, the same as for P(bad bit). But the
> checksum is actually done at the block level. So:
> 
> P(bad bit)
>   = probability bit is bad (BER)
>   = U
> 
> P(good bit)
>   = probability bit is good
>   = 1 - U
> 
> P(good block)
>   = probability D-bit block is all-bits-good
>   = (1 - U) ^ D
> 
> P(bad block)
>   = probability block has at least one bad bit
>   = 1 - P(good block)
>   = 1 - ((1 - U) ^ D)
> 
> P(bad check)
>   = probability checksum fails (i.e. a false match)
>   = 2^-C
> 
> P(good check)
>   = probability checksum succeeds (flags bad data)
>   = 1 - P(bad check)
>   = 1 - (2^-C)
> 
> P(block ok)
>   = probability block is all-good, or bad and flagged by checksum
>   = P(good block) + (P(bad block) * P(good check))

I don't think you can add these probabilities together. That only works for uncorrelated events, which this isn't. 

>   = ((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))
>   = 2^-C * ((1-U)^D-1)+1
> 
> N  = number of bits in data in which we're interested
> 
> P(data ok)
>   = probability data is ok (good, or flagged)
>   = P(block ok) ^ (number_of_blocks)
>   = P(block ok) ^ (N / D)
>   = (((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))) ^ (N / D)
>   = (2^-C * (((1-U)^D) - 1) + 1) ^ (N / D)
> 
> P(bad data)
>   = probability of getting unflagged bad data
>   = 1 - P(data ok)
>   = 1 - ((2^-C * ((1-U)^D-1)+1) ^ (N / D))
> 
> ...and this time the D term doesn't disappear. I.e. it supports Allen's
> contention that the block size matters.
> 
> Let's plug some numbers into Wolfram Alpha to see if we get plausible
> results:
> 
> U = 10^-15
> C = 32
> D = 4 * 1024
> N = 5PB = 5 * 8 * 10^15

Not that it matters much, but shouldn't D be 4 * 8 * 1024 ?
> 
> ---------
> Chris #1
> ---------
> P(bad data)
>   = 1 - ((1 - (U * 2^-C)) ^ N)
>   = 1 - ((1 - ((10^-15) * (2^-32))) ^ (5 * 8 * (10^15)))
>   = 9.3132257027866983914620845681582388414865913202250969 × 10^-9

If I understood your terminology, doesn't this apply a 32-bit checksum to each bit? If so, this result seems waaaaaay too high.

> 
> ---------
> Chris #2
> ---------
> P(bad data)
>   = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))
>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (4 *
> 1024)))
>   = 9.3132257027676295619288907910277855094237197529999931 × 10^-9
>   (differs from Chris #1 at 10^-20)
> 
> If this formula is correct, it's still relatively immune to the data block size, e.g.
> at D = (1024 * 1024):
>   = 1 - (((2^-32) * ((1-(10^-15))^(1024 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (1024
> * 1024)))
>   = 9.3132256979038905963931781911807383342120258917664411 × 10^-9
>   (differs from D=(4 * 1024) at 10^-17)
> 
> ---------
> Sage
> ---------
> > P(reading N bits and getting some bad data and not knowing it)
> > 	= U * N / 2^C
> 
> P(bad data)
>   = (10^-15 * 5 * 8 * 10^15) / (2^32)
>   = 9.31322574615478515625 × 10^-9
>   (differs from Chris #2 at 10^-17)
> 
> ---------
> Allen
> ---------
> > So the overall probability of a failure for this read is:
> > (1 - (1-U)^D) / (2^C).
> 
> P(bad data)
>   = (1 - (1-U)^D) / (2^C)
>   = (1 - (1-(10^-15))^(4 * 1024)) / (2^32)
>   = 9.536743164042973518371608678388595553788002932094000 × 10^-22
>   (seems implausibly low?)
> 

I think my formula handles the odds of an individual (512B with 32bit checksum) read silently failing. I'm not exactly sure, but I think you're computing the odds of reading all 5PB once without a silent failure. Apples/oranges... That's why the numbers are so far different.

> 
> Now what?
> 
> Anyone want to check the assumptions and calcs, and/or come up with your
> own?
> 
> On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage) are
> anywhere near correct, they indicate with a block size of 4K and 32-bit
> checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get to a 1% chance
> of seeing unflagged bad data, e.g.:
> 
> Chris #2 @ U=10^-15, C=32, D=(4 * 1024), N=(5 * 8 * 10^21)
>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * 10^21) / (4 *
> 1024)))
>   = 0.009269991978615439689381062400448881380202158231615685460
>   = 0.92%
> 
> 
> Chris
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f