> -----Original Message----- > From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx] > Sent: Thursday, March 31, 2016 9:58 AM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > <sage@xxxxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: Adding compression/checksum support for bluestore. > > > > On 30.03.2016 22:46, Allen Samuels wrote: > > [snip] > > > > Time to talk about checksums. > > > > First let's divide the world into checksums for data and checksums for > > metadata -- and defer the discussion about checksums for metadata > > (important, but one at a time...) > > > > I believe it's a requirement that when checksums are enabled that 100% of > data reads must be validated against their corresponding checksum. This > leads you to conclude that you must store a checksum for each > independently readable piece of data. > > > > When compression is disabled, it's relatively straightforward -- there's a > checksum for each 4K readable block of data. Presumably this is a simple > vector stored in the pextent structure with one entry for each 4K block of > data. > > > > Things get more complicated when compression is enabled. At a minimum, > you'll need a checksum for each blob of compressed data (I'm using blob > here as unit of data put into the compressor, but what I really mean is the > minimum amount of *decompressable* data). As I've pointed out before, > many of the compression algorithms do their own checksum validation. For > algorithms that don't do their own checksum we'll want one checksum to > protect the block -- however, there's no reason that we can't implement this > as one checksum for each 4K physical blob, the runtime cost is nearly > equivalent and it will considerably simplify the code. > > > Thus I think we really end up with a single, simple design. The pextent > structure contains a vector of checksums. Either that vector is empty > (checksum disabled) OR there is a checksum for each 4K block of data (not > this is NOT min_allocation size, it's minimum_read_size [if that's even a > parameter or does the code assume 4K readable blocks? [or worse, 512 > readable blocks?? -- if so, we'll need to cripple this]). > > > > When compressing with a compression algorithm that does checksuming > we can automatically suppress checksum generation. There should also be an > administrative switch for this. > > > > This allows the checksuming to be pretty much independent of > > compression -- which is nice :) > Mostly agree. > > But I think we should consider compression algorithm as a black box and rely > on a standalone checksum verification only. > And I suppose that the main purpose of the checksum validation at bluestore > level is to protect from HW failures. Thus we need to check > *physical* data. That is data before decompressing. Not sure I agree. Provided that it's "safe" (see later), there's no real difference between checking the checksum on compressed data or on the decompressed data. By "safe" I mean that if corrupt data is decompressed I don't corrupt the environment (fault, array index out of bounds, ...). However, when I think through the implementation of the code, I find it natural to do checksum generation/checking on the physical data (i.e, after compression and before decompression). So as long as we're doing the checksum we won't actually care whether the algorithm is "safe" or not.... > > Another thing to consider is an ability to use introduced checksums for > scrubbing. One can probably extend objectstore interface to be able to > validate stored data without data sending back. > I don't see such an option at the moment. Please correct me if I missed that. No need for that option. Just read the data check the return code (good or checksum failure) then discard the data. This is essentially exactly the same code-path as a "please validate the checksum" specialized opcode. > > > This got me thinking, we have another issue to discuss and resolve. > > > > The scenario is when compression is enabled. Assume that we've taken a > big blob of data and compressed it into a smaller blob. We then call the > allocator for that blob. What do we do if the allocator can't find a > CONTIGUOUS block of storage of that size??? In the non-compressed case, > it's relatively easy to simply break it up into smaller chunks -- but that doesn't > work well with compression. > > > > This isn't that unlikely a case, worse it could happen with shockingly high > amounts of freespace (>>75%) with some pathological access patterns. > > > > There's really only two choices. You either fracture the logical data and > recompress OR you modify the pextent data structure to handle this case. > The later isn't terribly difficult to do, you just make the size/address values > into a vector of pairs. The former scheme could be quite expensive CPU wise > as you may end up fracturing and recompressing multiple times (granted, in a > pathological case). The latter case adds space to each onode for a rare case. > The space is recoverable with an optimized serialize/deserializer (in essence > you could burn a flag to indicate when a vector of physical chunks/sizes is > needed instead of the usual scalar pair). > > > > IMO, we should pursue the later scenario as it avoids the variable latency > problem. I see the code/testing complexity of either choice as about the > same. > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html