> -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels > Sent: Wednesday, March 30, 2016 12:47 PM > To: Sage Weil; Igor Fedotov > Cc: ceph-devel > Subject: Adding compression/checksum support for bluestore. > > [snip] > > Time to talk about checksums. > > First let's divide the world into checksums for data and checksums for > metadata -- and defer the discussion about checksums for metadata > (important, but one at a time...) > > I believe it's a requirement that when checksums are enabled that 100% of > data reads must be validated against their corresponding checksum. This > leads you to conclude that you must store a checksum for each > independently readable piece of data. > > When compression is disabled, it's relatively straightforward -- there's a > checksum for each 4K readable block of data. Presumably this is a simple > vector stored in the pextent structure with one entry for each 4K block of > data. > > Things get more complicated when compression is enabled. At a minimum, > you'll need a checksum for each blob of compressed data (I'm using blob > here as unit of data put into the compressor, but what I really mean is the > minimum amount of *decompressable* data). As I've pointed out before, > many of the compression algorithms do their own checksum validation. For > algorithms that don't do their own checksum we'll want one checksum to > protect the block -- however, there's no reason that we can't implement this > as one checksum for each 4K physical blob, the runtime cost is nearly > equivalent and it will considerably simplify the code. > > Thus I think we really end up with a single, simple design. The pextent > structure contains a vector of checksums. Either that vector is empty > (checksum disabled) OR there is a checksum for each 4K block of data (not > this is NOT min_allocation size, it's minimum_read_size [if that's even a > parameter or does the code assume 4K readable blocks? [or worse, 512 > readable blocks?? -- if so, we'll need to cripple this]). > > When compressing with a compression algorithm that does checksuming we > can automatically suppress checksum generation. There should also be an > administrative switch for this. > > This allows the checksuming to be pretty much independent of compression > -- which is nice :) > > This got me thinking, we have another issue to discuss and resolve. > > The scenario is when compression is enabled. Assume that we've taken a big > blob of data and compressed it into a smaller blob. We then call the allocator > for that blob. What do we do if the allocator can't find a CONTIGUOUS block > of storage of that size??? In the non-compressed case, it's relatively easy to > simply break it up into smaller chunks -- but that doesn't work well with > compression. > > This isn't that unlikely a case, worse it could happen with shockingly high > amounts of freespace (>>75%) with some pathological access patterns. > > There's really only two choices. You either fracture the logical data and > recompress OR you modify the pextent data structure to handle this case. > The later isn't terribly difficult to do, you just make the size/address values > into a vector of pairs. The former scheme could be quite expensive CPU wise > as you may end up fracturing and recompressing multiple times (granted, in a > pathological case). The latter case adds space to each onode for a rare case. > The space is recoverable with an optimized serialize/deserializer (in essence > you could burn a flag to indicate when a vector of physical chunks/sizes is > needed instead of the usual scalar pair). > > IMO, we should pursue the later scenario as it avoids the variable latency > problem. I see the code/testing complexity of either choice as about the > same. > If I understand correctly, then there would still be a cost associated with writing dis-contiguously to disk. In cases such as this where the resources for compression are not easily available, I wonder if it is reasonable to simply not do compression for that Write. The cost of not compressing would be a missed space optimization, but the cost of compressing in any and all cases could be significant to latency. > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html