Re: Adding compression/checksum support for bluestore.

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Thu, 31 Mar 2016 19:31:32 +0300

On 30.03.2016 23:41, Vikas Sinha-SSI wrote:

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels
Sent: Wednesday, March 30, 2016 12:47 PM
To: Sage Weil; Igor Fedotov
Cc: ceph-devel
Subject: Adding compression/checksum support for bluestore.

[snip]

Time to talk about checksums.

First let's divide the world into checksums for data and checksums for
metadata -- and defer the discussion about checksums for metadata
(important, but one at a time...)

I believe it's a requirement that when checksums are enabled that 100% of
data reads must be validated against their corresponding checksum. This
leads you to conclude that you must store a checksum for each
independently readable piece of data.

When compression is disabled, it's relatively straightforward -- there's a
checksum for each 4K readable block of data. Presumably this is a simple
vector stored in the pextent structure with one entry for each 4K block of
data.

Things get more complicated when compression is enabled. At a minimum,
you'll need a checksum for each blob of compressed data (I'm using blob
here as unit of data put into the compressor, but what I really mean is the
minimum amount of *decompressable* data). As I've pointed out before,
many of the compression algorithms do their own checksum validation. For
algorithms that don't do their own checksum we'll want one checksum to
protect the block -- however, there's no reason that we can't implement this
as one checksum for each 4K physical blob, the runtime cost is nearly
equivalent and it will considerably simplify the code.

Thus I think we really end up with a single, simple design. The pextent
structure contains a vector of checksums. Either that vector is empty
(checksum disabled) OR there is a checksum for each 4K block of data (not
this is NOT min_allocation size, it's minimum_read_size [if that's even a
parameter or does the code assume 4K readable blocks? [or worse, 512
readable blocks?? -- if so, we'll need to cripple this]).

When compressing with a compression algorithm that does checksuming we
can automatically suppress checksum generation. There should also be an
administrative switch for this.

This allows the checksuming to be pretty much independent of compression
-- which is nice :)

This got me thinking, we have another issue to discuss and resolve.

The scenario is when compression is enabled. Assume that we've taken a big
blob of data and compressed it into a smaller blob. We then call the allocator
for that blob. What do we do if the allocator can't find a CONTIGUOUS block
of storage of that size??? In the non-compressed case, it's relatively easy to
simply break it up into smaller chunks -- but that doesn't work well with
compression.

This isn't that unlikely a case, worse it could happen with shockingly high
amounts of freespace (>>75%) with some pathological access patterns.

There's really only two choices. You either fracture the logical data and
recompress OR you modify the pextent data structure to handle this case.
The later isn't terribly difficult to do, you just make the size/address values
into a vector of pairs. The former scheme could be quite expensive CPU wise
as you may end up fracturing and recompressing multiple times (granted, in a
pathological case). The latter case adds space to each onode for a rare case.
The space is recoverable with an optimized serialize/deserializer (in essence
you could burn a flag to indicate when a vector of physical chunks/sizes is
needed instead of the usual scalar pair).

IMO, we should pursue the later scenario as it avoids the variable latency
problem. I see the code/testing complexity of either choice as about the
same.

If I understand correctly, then there would still be a cost associated with writing dis-contiguously
to disk. In cases such as this where the resources for compression are not easily available, I wonder
if it is reasonable to simply not do compression for that Write. The cost of not compressing would be
a missed space optimization, but the cost of compressing in any and all cases could be significant to latency.
Seems to be a reasonable and simple solution.
There is still a technical question to distinguish "no space" and "no 
contiguous space" allocation failure cases. Need to be addressed by the 
allocator...

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html