Re: BlueStore: minimizing blobs in overwrite cases?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey Gregory,

first of all please note that BlueStore doesn't allocate chunks less that min_alloc_size which is 16K for SSD and 64K for HDD by default.

Depending on the incoming block size there are two different write procedures:

1) big writes (block size is aligned with min_alloc_size)

2) small writes (block size in less than min_alloc_size). May be performed via deferred write or directly, see below

Large but unaligned blocks are split and pass through the combination of 1) and 2)


1) always triggers allocation of the new blob and writing to a different location.

2) If we're overwriting existing data then block passes through deferred write procedure. Which prepares 4K aligned block (by merging incoming data, padding and reading non-overlapped data), puts it into DB and then performs disk block overwrite at exactly the same location. This way a sort of WAL is provided (do not confuse with the one in RocksDB).

If writing goes to unused extent and block size <= bluestore_prefer_deferred_size (16K for HDD and 0 for SSD) than deferred procedure is applied as well.

Otherwise block is written directly to disk, along with allocation if needed.


Hence in your case (4K overwrites) the scenario looks like:

AAAA

ABAA + B temporarily put into RocksDB

ACAA +  C temporarily put into RocksDB

etc.

And proposed optimization makes no sense.

Indeed one might observe higher DB load this way.

Your original scenario:

AAAA
A[A]AA...B
A[A]AA...[B]...C
A[A]AA...[B]...[C]...D

is rather about big (16K/64K) writes. Not sure if any optimization is required here either. Maybe except the case when we want data to be less fragmented (for HDD?). But I doubt this is feasible .


Hope this helps.

Thanks,

Igor



On 2/15/2019 3:42 AM, Gregory Farnum wrote:
Hey Igor,
I don't know much about the BlueStore allocator pattern, so I don't
have a clear idea how difficult this is.
But I *believe* we have a common pattern in RBD that might be worth
optimizing for: the repeated-overwrites case. Often this would be some
kind of journal header — either for the FS stored on top, a database,
or whatever, that results in the same 4KB logical block getting
overwritten repeatedly.

For instance, librbd might write out
AAAA
to an object, then do updates to the second block resulting in a logical
ABAA
ACAA
ADAA
etc.

I think, from my very limited understanding and what I heard when I
asked this in standup, that right now the layout in BlueStore for this
will tend to be something like
AAAA
A[A]AA...B
A[A]AA...[B]...C
A[A]AA...[B]...[C]...D
where the brackets indicate a deallocated [hole]. I expect that to
happen (certainly for the first overwrite) as long as the incoming IO
is large enough to trigger an immediate write to disk and then an
update to the metadata, rather than stuffing the data in the WAL and
then doing a write-in-place.

So I wonder: is there any optimization to try and place incoming data
so that it closes up holes and allows merging the extents/blobs
(sorry, I forget the BlueStore internal terms)? If not, is this a
feasible optimization to try and apply at some point?
That way we could get an on disk layout pattern more like
AAAA
A[A]AA...B
ACAA...[B]
A[C]AA...D

I don't know what the full value of something like this would actually
be, but I was in some discussion recently where it came up that RBD
causes much larger RocksDB usage than RGW does, thanks to the
fragmented layouts it provokes. Cutting that down might be very good
for our long-term performance?
-Greg



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux