Hey Gregory,
first of all please note that BlueStore doesn't allocate chunks less
that min_alloc_size which is 16K for SSD and 64K for HDD by default.
Depending on the incoming block size there are two different write
procedures:
1) big writes (block size is aligned with min_alloc_size)
2) small writes (block size in less than min_alloc_size). May be
performed via deferred write or directly, see below
Large but unaligned blocks are split and pass through the combination of
1) and 2)
1) always triggers allocation of the new blob and writing to a different
location.
2) If we're overwriting existing data then block passes through deferred
write procedure. Which prepares 4K aligned block (by merging incoming
data, padding and reading non-overlapped data), puts it into DB and then
performs disk block overwrite at exactly the same location. This way a
sort of WAL is provided (do not confuse with the one in RocksDB).
If writing goes to unused extent and block size <=
bluestore_prefer_deferred_size (16K for HDD and 0 for SSD) than deferred
procedure is applied as well.
Otherwise block is written directly to disk, along with allocation if
needed.
Hence in your case (4K overwrites) the scenario looks like:
AAAA
ABAA + B temporarily put into RocksDB
ACAA + C temporarily put into RocksDB
etc.
And proposed optimization makes no sense.
Indeed one might observe higher DB load this way.
Your original scenario:
AAAA
A[A]AA...B
A[A]AA...[B]...C
A[A]AA...[B]...[C]...D
is rather about big (16K/64K) writes. Not sure if any optimization is
required here either. Maybe except the case when we want data to be less
fragmented (for HDD?). But I doubt this is feasible .
Hope this helps.
Thanks,
Igor
On 2/15/2019 3:42 AM, Gregory Farnum wrote:
Hey Igor,
I don't know much about the BlueStore allocator pattern, so I don't
have a clear idea how difficult this is.
But I *believe* we have a common pattern in RBD that might be worth
optimizing for: the repeated-overwrites case. Often this would be some
kind of journal header — either for the FS stored on top, a database,
or whatever, that results in the same 4KB logical block getting
overwritten repeatedly.
For instance, librbd might write out
AAAA
to an object, then do updates to the second block resulting in a logical
ABAA
ACAA
ADAA
etc.
I think, from my very limited understanding and what I heard when I
asked this in standup, that right now the layout in BlueStore for this
will tend to be something like
AAAA
A[A]AA...B
A[A]AA...[B]...C
A[A]AA...[B]...[C]...D
where the brackets indicate a deallocated [hole]. I expect that to
happen (certainly for the first overwrite) as long as the incoming IO
is large enough to trigger an immediate write to disk and then an
update to the metadata, rather than stuffing the data in the WAL and
then doing a write-in-place.
So I wonder: is there any optimization to try and place incoming data
so that it closes up holes and allows merging the extents/blobs
(sorry, I forget the BlueStore internal terms)? If not, is this a
feasible optimization to try and apply at some point?
That way we could get an on disk layout pattern more like
AAAA
A[A]AA...B
ACAA...[B]
A[C]AA...D
I don't know what the full value of something like this would actually
be, but I was in some discussion recently where it came up that RBD
causes much larger RocksDB usage than RGW does, thanks to the
fragmented layouts it provokes. Cutting that down might be very good
for our long-term performance?
-Greg