BlueStore: minimizing blobs in overwrite cases?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 14 Feb 2019 16:42:50 -0800

Hey Igor,
I don't know much about the BlueStore allocator pattern, so I don't
have a clear idea how difficult this is.
But I *believe* we have a common pattern in RBD that might be worth
optimizing for: the repeated-overwrites case. Often this would be some
kind of journal header — either for the FS stored on top, a database,
or whatever, that results in the same 4KB logical block getting
overwritten repeatedly.

For instance, librbd might write out
AAAA
to an object, then do updates to the second block resulting in a logical
ABAA
ACAA
ADAA
etc.

I think, from my very limited understanding and what I heard when I
asked this in standup, that right now the layout in BlueStore for this
will tend to be something like
AAAA
A[A]AA...B
A[A]AA...[B]...C
A[A]AA...[B]...[C]...D
where the brackets indicate a deallocated [hole]. I expect that to
happen (certainly for the first overwrite) as long as the incoming IO
is large enough to trigger an immediate write to disk and then an
update to the metadata, rather than stuffing the data in the WAL and
then doing a write-in-place.

So I wonder: is there any optimization to try and place incoming data
so that it closes up holes and allows merging the extents/blobs
(sorry, I forget the BlueStore internal terms)? If not, is this a
feasible optimization to try and apply at some point?
That way we could get an on disk layout pattern more like
AAAA
A[A]AA...B
ACAA...[B]
A[C]AA...D

I don't know what the full value of something like this would actually
be, but I was in some discussion recently where it came up that RBD
causes much larger RocksDB usage than RGW does, thanks to the
fragmented layouts it provokes. Cutting that down might be very good
for our long-term performance?
-Greg