On 11/22/2019 12:50 AM, Sage Weil
wrote:
The
key difference (at the bluestore level) between RGW and
RBD/CephFS
writes is that RGW passes down the CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY hints. The immutable one in
particular is what we really care about, since it's the mutable objects
that get overwrites that lead to (most) fragmentation. We should use this
to decide whether to create minimal (min_alloc_size) blobs or whether we
should keep the blobs larger to limit fragmentation.
I'm not sure what we would call the config option that isn't super
confusing... maybe bluestore_mutable_min_blob_size?
bluestore_baseline_min_blob_size?
I'm thinking about introducing the following behavior for
spinner-based stores (preliminary notes for now):
1) Unconditionally pin allocator's granularity to 4K (block
size), i.e. untie it from min_alloc_size setting
This will probably result in additional overhead in allocator
when one needs to allocate contiguous 64K blobs (see 3) below).
But it looks like new avl allocator is great in dealing with
that.
2) Allocate space using 4K block size for small (<=32K or
<=48K or <60K? ) objects tagged with IMMUTABLE flag
3) Use existing min_alloc_size defaults (64K) for all(?) other
objects
4) Tag object with alloc size it's using.
------- The above is enough enough to deal with RGW space
amplification for small objects which is the worst case for now
IMO.----------------
Additional points to consider:
5) add per-object (more flexible and complex) hint or per-store
setting to optimize onode keeping for size vs. for speed. May be
even spit the latter into optimize for reads vs. optimize for
overwrites?
One can read 'per-pool' instead of 'per-store' above.
EC and/or other concerned entries to indicate
"OPTIMIZE_FOR_SPACE". Unconditionally or using some logics?
6) apply appropriate alloc size depending on the strategy
determined at 5)
7) Even more versatile approach would be vary alloc size at
onode's blob level. Hence tag blob rather than object with applied
alloc size (needs single bit in fact).
May be useful to deal with space amplification for tail blobs
or sparse objects.
Thoughts?
Thanks,
Igor
sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx