Re: BlueStore: minimizing blobs in overwrite cases?

Igor Fedotov <ifedotov@xxxxxxx> · Sat, 16 Feb 2019 01:07:08 +0300

Hey Gregory,

first of all please note that BlueStore doesn't allocate chunks less 
that min_alloc_size which is 16K for SSD and 64K for HDD by default.

Depending on the incoming block size there are two different write 
procedures:

1) big writes (block size is aligned with min_alloc_size)

2) small writes (block size in less than min_alloc_size). May be 
performed via deferred write or directly, see below

Large but unaligned blocks are split and pass through the combination of 
1) and 2)

1) always triggers allocation of the new blob and writing to a different 
location.

2) If we're overwriting existing data then block passes through deferred 
write procedure. Which prepares 4K aligned block (by merging incoming 
data, padding and reading non-overlapped data), puts it into DB and then 
performs disk block overwrite at exactly the same location. This way a 
sort of WAL is provided (do not confuse with the one in RocksDB).

If writing goes to unused extent and block size <= 
bluestore_prefer_deferred_size (16K for HDD and 0 for SSD) than deferred 
procedure is applied as well.

Otherwise block is written directly to disk, along with allocation if 
needed.

Hence in your case (4K overwrites) the scenario looks like:

AAAA

ABAA + B temporarily put into RocksDB

ACAA +  C temporarily put into RocksDB

etc.

And proposed optimization makes no sense.

Indeed one might observe higher DB load this way.

Your original scenario:

AAAA
A[A]AA...B
A[A]AA...[B]...C
A[A]AA...[B]...[C]...D

is rather about big (16K/64K) writes. Not sure if any optimization is 
required here either. Maybe except the case when we want data to be less 
fragmented (for HDD?). But I doubt this is feasible .

Hope this helps.

Thanks,

Igor

On 2/15/2019 3:42 AM, Gregory Farnum wrote:
Hey Igor,
I don't know much about the BlueStore allocator pattern, so I don't
have a clear idea how difficult this is.
But I *believe* we have a common pattern in RBD that might be worth
optimizing for: the repeated-overwrites case. Often this would be some
kind of journal header — either for the FS stored on top, a database,
or whatever, that results in the same 4KB logical block getting
overwritten repeatedly.

For instance, librbd might write out
AAAA
to an object, then do updates to the second block resulting in a logical
ABAA
ACAA
ADAA
etc.

I think, from my very limited understanding and what I heard when I
asked this in standup, that right now the layout in BlueStore for this
will tend to be something like
AAAA
A[A]AA...B
A[A]AA...[B]...C
A[A]AA...[B]...[C]...D
where the brackets indicate a deallocated [hole]. I expect that to
happen (certainly for the first overwrite) as long as the incoming IO
is large enough to trigger an immediate write to disk and then an
update to the metadata, rather than stuffing the data in the WAL and
then doing a write-in-place.

So I wonder: is there any optimization to try and place incoming data
so that it closes up holes and allows merging the extents/blobs
(sorry, I forget the BlueStore internal terms)? If not, is this a
feasible optimization to try and apply at some point?
That way we could get an on disk layout pattern more like
AAAA
A[A]AA...B
ACAA...[B]
A[C]AA...D

I don't know what the full value of something like this would actually
be, but I was in some discussion recently where it came up that RBD
causes much larger RocksDB usage than RGW does, thanks to the
fragmented layouts it provokes. Cutting that down might be very good
for our long-term performance?
-Greg