Re: BlueStore: minimizing blobs in overwrite cases?

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 19 Feb 2019 19:06:49 -0500

On Tue, Feb 19, 2019 at 5:11 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Fri, Feb 15, 2019 at 2:07 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:
> >
> > Hey Gregory,
> >
> > first of all please note that BlueStore doesn't allocate chunks less
> > that min_alloc_size which is 16K for SSD and 64K for HDD by default.
>
> Oh, I thought it was 4KB for SSD. Maybe I confused it with the
> prefer_deferred_size below.
>
> > Depending on the incoming block size there are two different write
> > procedures:
> >
> > 1) big writes (block size is aligned with min_alloc_size)
> >
> > 2) small writes (block size in less than min_alloc_size). May be
> > performed via deferred write or directly, see below
> >
> > Large but unaligned blocks are split and pass through the combination of
> > 1) and 2)
> >
> >
> > 1) always triggers allocation of the new blob and writing to a different
> > location.
> >
> > 2) If we're overwriting existing data then block passes through deferred
> > write procedure. Which prepares 4K aligned block (by merging incoming
> > data, padding and reading non-overlapped data), puts it into DB and then
> > performs disk block overwrite at exactly the same location. This way a
> > sort of WAL is provided (do not confuse with the one in RocksDB).
> >
> > If writing goes to unused extent and block size <=
> > bluestore_prefer_deferred_size (16K for HDD and 0 for SSD) than deferred
> > procedure is applied as well.
> >
> > Otherwise block is written directly to disk, along with allocation if
> > needed.
> >
> >
> > Hence in your case (4K overwrites) the scenario looks like:
> >
> > AAAA
> >
> > ABAA + B temporarily put into RocksDB
> >
> > ACAA +  C temporarily put into RocksDB
> >
> > etc.
> >
> > And proposed optimization makes no sense.
> >
> > Indeed one might observe higher DB load this way.
> >
> > Your original scenario:
> >
> > AAAA
> > A[A]AA...B
> > A[A]AA...[B]...C
> > A[A]AA...[B]...[C]...D
> >
> > is rather about big (16K/64K) writes. Not sure if any optimization is
> > required here either. Maybe except the case when we want data to be less
> > fragmented (for HDD?). But I doubt this is feasible .
>
> Well, there's two reasons to prefer defragmented data:
> 1) sequential read speeds are higher than random, even on SSD.
> 2) defragmented data means we have to store fewer extents in RocksDB.
>
> I definitely don't know how expensive that is, though!
>
> Or indeed if we actually get IO patterns that would trigger this. I
> was thinking specifically of RBD, so if it's not going to trigger
> patterns like this it's not worth worrying about. Jason, do we have
> any data on what IO sizes tend to look like?

I only have IO data on fio workloads, mongodb, and a commercial VM
backup tool -- all of which was running (unfortunately) the Luminous
bluestore stupid allocator instead of the current bitmap allocator.

When paired with a BBU controller w/ writeback and readahead
capabilities, I think it's fair to say that <4MiB sequential IOs are
impacted. Even though librbd sends an allocation hint with each IO,
that is disregarded under Bluestore [1] and therefore you could really
only expect <min-alloc-size> chunks of locality.

For real-world workload, the mongodb IO sizes were averaging 8-16KiB I
believe (unknown if sequential or random pattern, but even if random
the data would still be virtually nearby seek-wise). The backup
software was around 130KiB in sequential IO, but from bluestore's PoV
it translated to lots of 32KiB-64KiB random IO with long seeks [2].

> -Greg
>
> >
> >
> > Hope this helps.
> >
> > Thanks,
> >
> > Igor
> >
> >
> >
> > On 2/15/2019 3:42 AM, Gregory Farnum wrote:
> > > Hey Igor,
> > > I don't know much about the BlueStore allocator pattern, so I don't
> > > have a clear idea how difficult this is.
> > > But I *believe* we have a common pattern in RBD that might be worth
> > > optimizing for: the repeated-overwrites case. Often this would be some
> > > kind of journal header — either for the FS stored on top, a database,
> > > or whatever, that results in the same 4KB logical block getting
> > > overwritten repeatedly.
> > >
> > > For instance, librbd might write out
> > > AAAA
> > > to an object, then do updates to the second block resulting in a logical
> > > ABAA
> > > ACAA
> > > ADAA
> > > etc.
> > >
> > > I think, from my very limited understanding and what I heard when I
> > > asked this in standup, that right now the layout in BlueStore for this
> > > will tend to be something like
> > > AAAA
> > > A[A]AA...B
> > > A[A]AA...[B]...C
> > > A[A]AA...[B]...[C]...D
> > > where the brackets indicate a deallocated [hole]. I expect that to
> > > happen (certainly for the first overwrite) as long as the incoming IO
> > > is large enough to trigger an immediate write to disk and then an
> > > update to the metadata, rather than stuffing the data in the WAL and
> > > then doing a write-in-place.
> > >
> > > So I wonder: is there any optimization to try and place incoming data
> > > so that it closes up holes and allows merging the extents/blobs
> > > (sorry, I forget the BlueStore internal terms)? If not, is this a
> > > feasible optimization to try and apply at some point?
> > > That way we could get an on disk layout pattern more like
> > > AAAA
> > > A[A]AA...B
> > > ACAA...[B]
> > > A[C]AA...D
> > >
> > > I don't know what the full value of something like this would actually
> > > be, but I was in some discussion recently where it came up that RBD
> > > causes much larger RocksDB usage than RGW does, thanks to the
> > > fragmented layouts it provokes. Cutting that down might be very good
> > > for our long-term performance?
> > > -Greg

[1] I think it used to control the CRC size but now it doesn't (?)
[2] https://raw.githubusercontent.com/dillaman/public/master/bluestore/high%20latency%20IO.png
(stupid allocator-induced seeks)

-- 
Jason