Re: Adding compression/checksum support for bluestore.

Chris Dunlop <chris@xxxxxxxxxxxx> · Thu, 7 Apr 2016 03:17:02 +1000

On Wed, Apr 06, 2016 at 03:47:41PM +0000, Allen Samuels wrote:
>> From: Chris Dunlop [mailto:chris@xxxxxxxxxxxx]
>> On Wed, Apr 06, 2016 at 01:10:30AM +1000, Chris Dunlop wrote:
>> OK, here's my (1st!, more below) working:

<snip>

> Well, that's correct if you have a "C" bit checksum for each bit. It's not ok if the checksum is per-block.

Yeah, I realised that in the end.

>> OK, here's another attempt, looking at it a little differently: my 1st working
>> applies the P(bad check) at the bit level, the same as for P(bad bit). But the
>> checksum is actually done at the block level. So:
>> 
>> P(bad bit)
>>   = probability bit is bad (BER)
>>   = U
>> 
>> P(good bit)
>>   = probability bit is good
>>   = 1 - U
>> 
>> P(good block)
>>   = probability D-bit block is all-bits-good
>>   = (1 - U) ^ D
>> 
>> P(bad block)
>>   = probability block has at least one bad bit
>>   = 1 - P(good block)
>>   = 1 - ((1 - U) ^ D)
>> 
>> P(bad check)
>>   = probability checksum fails (i.e. a false match)
>>   = 2^-C
>> 
>> P(good check)
>>   = probability checksum succeeds (flags bad data)
>>   = 1 - P(bad check)
>>   = 1 - (2^-C)
>> 
>> P(block ok)
>>   = probability block is all-good, or bad and flagged by checksum
>>   = P(good block) + (P(bad block) * P(good check))
> 
> I don't think you can add these probabilities together. That only works for uncorrelated events, which this isn't. 

P(good block) and P(bad block) are mutually exclusive, so adding is what you
want to do. The sum of the two is 1: every block is either good (no bit
errors) or bad (bit errors). Multiplying P(bad block) by P(good check) 
gives us the probability a bad block will be flagged by the checksum.

>>   = ((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))
>>   = 2^-C * ((1-U)^D-1)+1
>> 
>> N  = number of bits in data in which we're interested
>> 
>> P(data ok)
>>   = probability data is ok (good, or flagged)
>>   = P(block ok) ^ (number_of_blocks)
>>   = P(block ok) ^ (N / D)
>>   = (((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))) ^ (N / D)
>>   = (2^-C * (((1-U)^D) - 1) + 1) ^ (N / D)
>> 
>> P(bad data)
>>   = probability of getting unflagged bad data
>>   = 1 - P(data ok)
>>   = 1 - ((2^-C * ((1-U)^D-1)+1) ^ (N / D))
>> 
>> ...and this time the D term doesn't disappear. I.e. it supports Allen's
>> contention that the block size matters.
>> 
>> Let's plug some numbers into Wolfram Alpha to see if we get plausible
>> results:
>> 
>> U = 10^-15
>> C = 32
>> D = 4 * 1024
>> N = 5PB = 5 * 8 * 10^15
> 
> Not that it matters much, but shouldn't D be 4 * 8 * 1024 ?

Oops! Yup.

>> ---------
>> Chris #1
>> ---------
>> P(bad data)
>>   = 1 - ((1 - (U * 2^-C)) ^ N)
>>   = 1 - ((1 - ((10^-15) * (2^-32))) ^ (5 * 8 * (10^15)))
>>   = 9.3132257027866983914620845681582388414865913202250969 × 10^-9
> 
> If I understood your terminology, doesn't this apply a 32-bit checksum to each bit? If so, this result seems waaaaaay too high.

Yes, that's my first attempt which was incorrect for exactly that reason.

>> ---------
>> Chris #2
>> ---------
>> P(bad data)
>>   = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))
>>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (4 * 1024)))
>>   = 9.3132257027676295619288907910277855094237197529999931 × 10^-9
>>   (differs from Chris #1 at 10^-20)
>> 
>> If this formula is correct, it's still relatively immune to the data block size, e.g.
>> at D = (1024 * 1024):
>>   = 1 - (((2^-32) * ((1-(10^-15))^(1024 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (1024 * 1024)))
>>   = 9.3132256979038905963931781911807383342120258917664411 × 10^-9
>>   (differs from D=(4 * 1024) at 10^-17)

<snip>

>> ---------
>> Allen
>> ---------
>>> So the overall probability of a failure for this read is:
>>> (1 - (1-U)^D) / (2^C).
>> 
>> P(bad data)
>>   = (1 - (1-U)^D) / (2^C)
>>   = (1 - (1-(10^-15))^(4 * 1024)) / (2^32)
>>   = 9.536743164042973518371608678388595553788002932094000 × 10^-22
>>   (seems implausibly low?)
> 
> I think my formula handles the odds of an individual (512B with 32bit
> checksum) read silently failing. I'm not exactly sure, but I think you're
> computing the odds of reading all 5PB once without a silent failure.
> Apples/oranges... That's why the numbers are so far different.

Doh! Sorry, yes, it should have been obvious to me that your formula didn't
include the N term.

Let's say your formula is giving P(block not ok) (i.e. read silently failed).
To expand this to P(bad data), i.e. the odds of a failure when reading N bits:

P(block ok)
  = probability block is all-good, or bad and flagged by checksum
  = 1 - P(block not ok)
  = 1 - ((1 - (1-U)^D) / (2^C))

P(good data)
  = probability all data is ok (good, or flagged)
  = P(block ok) ^ (number_of_blocks)
  = P(block ok) ^ (N / D)
  = (1 - ((1 - (1-U)^D) / (2^C))) ^ (N / D)

P(bad data)
  = 1 - P(good data)
  = 1 - ((1 - ((1 - (1-U)^D) / (2^C))) ^ (N / D))

Compare that to my #2 from above:

P(bad data)
  = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))

With a little bit of manipulation (lazy way: use Wolfram Alpha's "simplify"
function on each formula), they can both be expressed as:

P(bad data)
  = 1 - (2^-C * (1-U)^D - 2^-C + 1) ^ (N / D)

I.e. we've independently arrived at the same place. Yay!

It just took me a lot longer... :-)

>> On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage) are
>> anywhere near correct, they indicate with a block size of 4K and 32-bit
>> checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get to a 1% chance
>> of seeing unflagged bad data, e.g.:
>> 
>> Chris #2 @ U=10^-15, C=32, D=(4 * 1024), N=(5 * 8 * 10^21)
>>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * 10^21) / (4 * 1024)))
>>   = 0.009269991978615439689381062400448881380202158231615685460
>>   = 0.92%

Just for completeness, using the corrected D = (4 * 8 * 1024):

P(bad data) @ U=10^-15, C=32, D=(4 * 8 * 1024), N=(5 * 8 * 10^21)
  = 1 - (2^-C * (1-U)^D - 2^-C + 1) ^ (N / D)
  = 1 - (2^-32 * (1-(10^-15))^(4 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
  = 0.009269991978483162962573463579660791470065102520727107106
  = 0.92%

I.e. the same as before, up to 10^-12. Not surprising as the previous D =
(1024 * 1024) example demonstrated it's relatively insensitive to the block
size.

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html