> -----Original Message----- > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > Sent: Wednesday, March 30, 2016 3:30 PM > To: Gregory Farnum <gfarnum@xxxxxxxxxx> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Igor Fedotov > <ifedotov@xxxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: Adding compression/checksum support for bluestore. > > On Wed, 30 Mar 2016, Gregory Farnum wrote: > > On Wed, Mar 30, 2016 at 3:15 PM, Sage Weil <sage@xxxxxxxxxxxx> > wrote: > > > On Wed, 30 Mar 2016, Allen Samuels wrote: > > >> [snip] > > >> > > >> Time to talk about checksums. > > >> > > >> First let's divide the world into checksums for data and checksums > > >> for metadata -- and defer the discussion about checksums for > > >> metadata (important, but one at a time...) > > >> > > >> I believe it's a requirement that when checksums are enabled that > > >> 100% of data reads must be validated against their corresponding > checksum. > > >> This leads you to conclude that you must store a checksum for each > > >> independently readable piece of data. > > > > > > +1 > > > > > >> When compression is disabled, it's relatively straightforward -- > > >> there's a checksum for each 4K readable block of data. Presumably > > >> this is a simple vector stored in the pextent structure with one > > >> entry for each 4K block of data. > > > > > > Maybe. If the object is known to be sequentail write and sequential > > > read, or even sequential write and random read but on a HDD-like > > > medium, then we can checksum on something like 128K (since it > > > doesn't cost any more to read 128k than 4k). I think the checksum > > > block size should be a per-object property. *Maybe* a pextent > > > property, given that compression is also entering the picture. > > > > > >> Things get more complicated when compression is enabled. At a > > >> minimum, you'll need a checksum for each blob of compressed data > > >> (I'm using blob here as unit of data put into the compressor, but > > >> what I really mean is the minimum amount of *decompressable* data). > > >> As I've pointed out before, many of the compression algorithms do > > >> their own checksum validation. For algorithms that don't do their > > >> own checksum we'll want one checksum to protect the block -- > > >> however, there's no reason that we can't implement this as one > > >> checksum for each 4K physical blob, the runtime cost is nearly > > >> equivalent and it will considerably simplify the code. > > > > > > I'm just worried about the size of metadata if we have 4k checksums > > > but have to read big extents anyway; cheaper to store a 4 byte > > > checksum for each compressed blob. > > > > I haven't followed the BlueStore discussion closely, but doesn't it > > still allow for data overwrites if they're small and the object block > > is large? I presume the goal of 4K instead of per-chunk checksums is > > to prevent turning those overwrites into a read-modify-write. (Is that > > sufficient?) > > You can overwrite small things (< min_alloc_size, and thus not written to a > newly allocated location), although it has to happen as a logged WAL event to > make it atomic. It's true that small checksums make that r/m/w small, but > the difference in cost on an HDD between 4K and 64K or 128K is almost > unnoticeable. On flash the calculus is probably different--and the cost of a > larger onode is probably less of an issue. > > 4K checksums for a 4MB objects 1024 * 32 bits -> 4K of metadata. That takes > an onode from ~300 bytes currently to say 4400 bytes... about an order of > magnitude. That will significantly reduce the amount of metadata we are > able to cache in RAM. One option is 16-bit checksums. Not as crazy as it sounds. Let me talk to my HW guys about this. > > My gut tells me 4K or 16K checksums for SSD (4400 - 1300 byte onodes), 128K > for HDD (~430 bytes, only 128 bytes of csums, less than double)... > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html