One thing to also factor in is that if you increase the span of a checksum, you degrade the quality of the checksum. So if you go with 128K chunks of data you'll likely want to increase the checksum itself from something beyond a CRC-32. Maybe somebody out there has a good way of describing this quanitatively. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > Sent: Wednesday, March 30, 2016 3:16 PM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > Cc: Igor Fedotov <ifedotov@xxxxxxxxxxxx>; ceph-devel <ceph- > devel@xxxxxxxxxxxxxxx> > Subject: Re: Adding compression/checksum support for bluestore. > > On Wed, 30 Mar 2016, Allen Samuels wrote: > > [snip] > > > > Time to talk about checksums. > > > > First let's divide the world into checksums for data and checksums for > > metadata -- and defer the discussion about checksums for metadata > > (important, but one at a time...) > > > > I believe it's a requirement that when checksums are enabled that 100% > > of data reads must be validated against their corresponding checksum. > > This leads you to conclude that you must store a checksum for each > > independently readable piece of data. > > +1 > > > When compression is disabled, it's relatively straightforward -- > > there's a checksum for each 4K readable block of data. Presumably this > > is a simple vector stored in the pextent structure with one entry for > > each 4K block of data. > > Maybe. If the object is known to be sequentail write and sequential read, or > even sequential write and random read but on a HDD-like medium, then we > can checksum on something like 128K (since it doesn't cost any more to read > 128k than 4k). I think the checksum block size should be a per-object > property. *Maybe* a pextent property, given that compression is also > entering the picture. > > > Things get more complicated when compression is enabled. At a minimum, > > you'll need a checksum for each blob of compressed data (I'm using > > blob here as unit of data put into the compressor, but what I really > > mean is the minimum amount of *decompressable* data). As I've pointed > > out before, many of the compression algorithms do their own checksum > > validation. For algorithms that don't do their own checksum we'll want > > one checksum to protect the block -- however, there's no reason that > > we can't implement this as one checksum for each 4K physical blob, the > > runtime cost is nearly equivalent and it will considerably simplify > > the code. > > I'm just worried about the size of metadata if we have 4k checksums but > have to read big extents anyway; cheaper to store a 4 byte checksum for > each compressed blob. > > > Thus I think we really end up with a single, simple design. The > > pextent structure contains a vector of checksums. Either that vector > > is empty (checksum disabled) OR there is a checksum for each 4K block > > of data (not this is NOT min_allocation size, it's minimum_read_size > > [if that's even a parameter or does the code assume 4K readable > > blocks? [or worse, > > 512 readable blocks?? -- if so, we'll need to cripple this]). > > > > When compressing with a compression algorithm that does checksuming > we > > can automatically suppress checksum generation. There should also be > > an administrative switch for this. > > > > This allows the checksuming to be pretty much independent of > > compression > > -- which is nice :) > > > > > This got me thinking, we have another issue to discuss and resolve. > > > > The scenario is when compression is enabled. Assume that we've taken a > > big blob of data and compressed it into a smaller blob. We then call > > the allocator for that blob. What do we do if the allocator can't find > > a CONTIGUOUS block of storage of that size??? In the non-compressed > > case, it's relatively easy to simply break it up into smaller chunks > > -- but that doesn't work well with compression. > > > > This isn't that unlikely a case, worse it could happen with shockingly > > high amounts of freespace (>>75%) with some pathological access > > patterns. > > > > There's really only two choices. You either fracture the logical data > > and recompress OR you modify the pextent data structure to handle this > > case. The later isn't terribly difficult to do, you just make the > > size/address values into a vector of pairs. The former scheme could be > > quite expensive CPU wise as you may end up fracturing and > > recompressing multiple times (granted, in a pathological case). The > > latter case adds space to each onode for a rare case. The space is > > recoverable with an optimized serialize/deserializer (in essence you > > could burn a flag to indicate when a vector of physical chunks/sizes > > is needed instead of the usual scalar pair). > > > > IMO, we should pursue the later scenario as it avoids the variable > > latency problem. I see the code/testing complexity of either choice as > > about the same. > > Hrm, I hadn't thought about this one. :( > > What about a third option: we ask the allocator for the uncompressed size, > and *then* compress. If it gives us something small, we will know then to > compress a smaller piece. It just means that we'll be returning space back to > the allocator in the general case after we compress, which will burn a bit of > CPU, and may screw things up when lots of threads are allocating in parallel > and we hope to lay them out sequentially. > > Or, maybe we flip into this sort of pessimistic allocation mode only when the > amount of space above a certain size threshold is low. With the current > binned allocator design this is trivial; it probably is pretty easy with your > bitmap-based approach as well with some minimal accounting. > > I really don't like the idea of making pextent's able to store fractions of a > compressed blob; it'll complicate the structures and code paths significantly, > and they'll be complex enough as it is. :( > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html