RE: bluestore onode encoding efficiency

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Thu, 16 Jun 2016 21:33:10 +0000



> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: Thursday, June 16, 2016 1:58 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Evgeniy Firsov
> <Evgeniy.Firsov@xxxxxxxxxxx>; Jianjian Huo <jianjian.huo@xxxxxxxxxxx>;
> Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; Igor Fedotov
> <ifedotov@xxxxxxxxxxxx>; Manavalan Krishnan
> <Manavalan.Krishnan@xxxxxxxxxxx>; Varada Kari
> <Varada.Kari@xxxxxxxxxxx>; Ramesh Chander
> <Ramesh.Chander@xxxxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: bluestore onode encoding efficiency
> 
> On Thu, 16 Jun 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> > > Sent: Thursday, June 16, 2016 9:47 AM
> > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> > > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Evgeniy Firsov
> > > <Evgeniy.Firsov@xxxxxxxxxxx>; Jianjian Huo
> > > <jianjian.huo@xxxxxxxxxxx>; Somnath Roy
> <Somnath.Roy@xxxxxxxxxxx>;
> > > Igor Fedotov <ifedotov@xxxxxxxxxxxx>; Manavalan Krishnan
> > > <Manavalan.Krishnan@xxxxxxxxxxx>; Varada Kari
> > > <Varada.Kari@xxxxxxxxxxx>; Ramesh Chander
> > > <Ramesh.Chander@xxxxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
> > > Subject: Re: bluestore onode encoding efficiency
> > >
> > > Based on some of Allen's comments I've updated my branch with (so
> > > far) three different encoders:
> > >
> > > 1) varint - general purpose small integers (lops off high and low
> > > zero
> > > bits)
> > >
> > >   first byte:
> > >     low 2 bits = how many low nibbles of zeros
> > >     5 bits = data
> > >     1 high bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 high bit = another byte follows
> > >
> > > 2) delta varint
> > >
> > >   first byte:
> > >     1 low bit = sign (0 = positive, 1 = negative)
> > >     low 2 bits = how many low nibbles of zeros
> > >     4 bits = data
> > >     1 high bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 high bit = another byte follows
> > >
> > > 3) raw lba:
> > >
> > >   first 3 bytes:
> > >     low 2 bits = how many low bits of zeros
> > >       00 = none
> > >       01 = 12 (4k alignment)
> > >       10 = 16 (64k alignment)
> > >       11 = 20 (256k alignment)
> > >     21 bits = data
> > >     1 high bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 high bit = another byte follows
> >
> > Let's do some math here :)
> >
> > Let's say I want to optimize for 4, 8, 16 and 32 TB devices going
> > forward.
> >
> > That's 2^42, 43, 44 and 45 respectively.
> >
> > If assume a 4K blocksize/alignment, then we need 30, 31, 32 and 33
> > significant bits after downshifting for encoding.
> >
> > That means for 30 bits I'll have 21 + 7 + 2 encoded bits which
> > requires
> > 5 bytes for a 4TB device. However, 1/2 of the addresses only need 29
> > bits (and 1/4 need 28, etc.) So the approximate blended size is about
> > 4.75 Bytes for a 4TB (1/4 of the addresses can save a byte) And about
> > 4.875 Bytes for 8TB (1/8 of the addresses can save a byte). We won't
> > need another byte until we operate on 128TB devices.
> >
> > If we switch to 64K alignment (reasonable for many HDD use-cases),
> > then the #'s change to be 26, 27, 28 and 29 respectively,
> >
> > That's 21 + 5 for 4TB which gets encoded in 4 bytes. 8 and 16TB also
> > take 4Bytes, you need 32TB before you need 5Bytes.
> >
> > If we change the encoding above so that the first chunk is 4 bytes
> > (easier to deal with :)) and leave everything alone, then we have 39
> > bits of mantissa Now for a 4KB align / 4TB device you only need 4.5
> > Bytes which is a savings of .25 Bytes, which will could easily be
> > significant when you have up to 1K phys addrs / oNode.
> 
> Yeah, I did my math wrong... I thought I was getting 3 bytes for ~1TB devices
> at 64K alignment.  Not that those drives are common even now anyway,
> though.  Moving to 4 bytes will be a faster encode/decode too.
> 
> > I could argue for a skew on the format encoding, i.e., 0x for 4K, 110
> > for 16K, 111 for byte align, ,etc. and gain another bit picking up a
> > further .5 bytes on a 4TB device.
> 
> The other nice thing about this is we get another options for dropping low
> bits:
> 
>  0* = 12 (4k)
>  100* = byte align
>  101* = 16 (64k)
>  110* = 20 (256k)
>  111* = 24 (1024k)
> 
> or perhaps go by 3's.
> 
> BTW, here's the first set of size cleanups.  It rips out unused fields, including
> the overlay stuff.  We can re-add it later if we decide it's a strategy worth
> pursuing.
> 
> 	https://github.com/ceph/ceph/pull/9756

Let's merge this ASAP, it's SOOOO much cleaner :)

> 
> sage
> 
> 
> > The 16-bit alignment case isn't materially affected by this.
> >
> > In conclusion. I think -- at a minimum -- you should switch to a first
> > 4 bytes (rather than a first 3 bytes) for this use case. It doesn't
> > seem to have any negative and there are significant positives.
> >
> > A weaker case would be to switch the two bit encoding to something
> > that favored 4K (the likely most prevalent) alignment, picking up
> > another bit before a multi-byte encoding is needed.
> >
> > >
> > > 4) lba delta (distance between two lba's, e.g., when encoding a list
> > > of
> > > extents)
> > >
> > >   first byte:
> > >     1 low bit = sign (0 = positive, 1 = negative)
> > >     2 bits = how many low bits of zeros
> > >       00 = none
> > >       01 = 12 (4k alignment)
> > >       10 = 16 (64k alignment)
> > >       11 = 20 (256k alignment)
> > >     4 bits = data
> > >     1 bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 bit = another byte follows
> > >
> > >   Notably on this one we have 4 bits of data *and* when we roll over to
> > >   the next value you'll get 4 trailing 0's and we ask for one
> > >   more nibble of trailing 0's... still in one encoded byte.
> > >
> > >
> > > I think this'll be a decent set of building blocks to encoding the
> > > existing structures efficiently (and still in a generic way) before
> > > getting specific with common patterns.
> > >
> > > 	https://github.com/ceph/ceph/pull/9728/files
> > >
> > > sage
> > >
> > >
> > > On Wed, 15 Jun 2016, Sage Weil wrote:
> > > >
> > > > If we have those, I'm not sure #1 will be worth it--the zeroed
> > > > offset fields will encode with one byte.
> > > >
> > > > > (3) re-jiggering of blob/extents when possible. Much of the
> > > > > two-level blob/extent map exists to support compression. When
> > > > > you're not compressed you can collapse this into a single blob
> > > > > and avoid the encoding overhead for it.
> > > >
> > > > Hmm, good idea.  As long as the csum parameters match we can do
> this.
> > > > The existing function
> > > >
> > > > int bluestore_onode_t::compress_extent_map()
> > > >
> > > > currently just combines consecutive lextent's that point to
> > > > contiguous regions in the same blob.  We could extend this to
> > > > combine blobs that are combinable.
> > > >
> > > > > There are other potential optimizations too that are artifacts
> > > > > of the current code. For example, we support different checksum
> > > > > algorithms/values on a per-blob basis. Clearly moving this to a
> > > > > per-oNode basis is acceptable and would simplify and shrink the
> > > > > encoding even more.
> > > >
> > > > The latest csum branch
> > > >
> > > >         https://github.com/ceph/ceph/pull/9526
> > > >
> > > > varies csum_order on a per-blob basis (for example, larger csum
> > > > chunks for compressed blobs and small csum chunks for uncompressed
> > > > blobs with 4k overwrites).  The alg is probably consistent across
> > > > the onode, but the will uglify the code a bit to pass it into the
> > > > blob_t csum methods.  I'd prefer to hold off on this.  With the
> > > > varint encoding above it'll only be one byte per blob at least.
> > > >
> > > > sage
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html