Re: Adding compression/checksum support for bluestore.

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 30 Mar 2016 18:30:11 -0400 (EDT)

On Wed, 30 Mar 2016, Gregory Farnum wrote:
> On Wed, Mar 30, 2016 at 3:15 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Wed, 30 Mar 2016, Allen Samuels wrote:
> >> [snip]
> >>
> >> Time to talk about checksums.
> >>
> >> First let's divide the world into checksums for data and checksums for
> >> metadata -- and defer the discussion about checksums for metadata
> >> (important, but one at a time...)
> >>
> >> I believe it's a requirement that when checksums are enabled that 100%
> >> of data reads must be validated against their corresponding checksum.
> >> This leads you to conclude that you must store a checksum for each
> >> independently readable piece of data.
> >
> > +1
> >
> >> When compression is disabled, it's relatively straightforward -- there's
> >> a checksum for each 4K readable block of data. Presumably this is a
> >> simple vector stored in the pextent structure with one entry for each 4K
> >> block of data.
> >
> > Maybe.  If the object is known to be sequentail write and sequential read,
> > or even sequential write and random read but on a HDD-like medium, then we
> > can checksum on something like 128K (since it doesn't cost any more to
> > read 128k than 4k).  I think the checksum block size should be a
> > per-object property.  *Maybe* a pextent property, given that compression
> > is also entering the picture.
> >
> >> Things get more complicated when compression is enabled. At a minimum,
> >> you'll need a checksum for each blob of compressed data (I'm using blob
> >> here as unit of data put into the compressor, but what I really mean is
> >> the minimum amount of *decompressable* data). As I've pointed out
> >> before, many of the compression algorithms do their own checksum
> >> validation. For algorithms that don't do their own checksum we'll want
> >> one checksum to protect the block -- however, there's no reason that we
> >> can't implement this as one checksum for each 4K physical blob, the
> >> runtime cost is nearly equivalent and it will considerably simplify the
> >> code.
> >
> > I'm just worried about the size of metadata if we have 4k checksums but
> > have to read big extents anyway; cheaper to store a 4 byte checksum for
> > each compressed blob.
> 
> I haven't followed the BlueStore discussion closely, but doesn't it
> still allow for data overwrites if they're small and the object block
> is large? I presume the goal of 4K instead of per-chunk checksums is
> to prevent turning those overwrites into a read-modify-write. (Is that
> sufficient?)

You can overwrite small things (< min_alloc_size, and thus not written to 
a newly allocated location), although it has to happen as a logged WAL 
event to make it atomic.  It's true that small checksums make that r/m/w 
small, but the difference in cost on an HDD between 4K and 64K or 128K is 
almost unnoticeable.  On flash the calculus is probably different--and the 
cost of a larger onode is probably less of an issue.

4K checksums for a 4MB objects 1024 * 32 bits -> 4K of metadata.  That 
takes an onode from ~300 bytes currently to say 4400 bytes... about an 
order of magnitude.  That will significantly reduce the amount of metadata 
we are able to cache in RAM.

My gut tells me 4K or 16K checksums for SSD (4400 - 1300 byte onodes), 
128K for HDD (~430 bytes, only 128 bytes of csums, less than double)...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html