Hi Casey,
A while back Igor refactored the code in bluestore to allow us to have
small min_alloc sizes on HDDs without a significant performance penalty
(this was really great work btw Igor!). The default now is a 4k
min_alloc_size on both NVMe and HDD:
https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284
There was a bug causing part of this change to increase write
amplification dramatically on the DB/WAL device, but this has been
(mostly) fixed as of last week. It will still likely be somewhat higher
than in Nautilus (not clear yet how much this is due to more metadata vs
unnecessary deferred write flushing/compaction), but the space
amplification benefit is very much worth it.
Mark
On 8/18/21 2:38 PM, Casey Bodley wrote:
in the rgw refactoring meetings, we've been discussing ways to improve
space utilization for workloads of mixed object sizes
i think it's worth bring this up in Mark's performance call as well,
to explore other options from the osd/librados perspective
most of our discussion so far has centered around ways to use s3's
storage classes (which rgw maps to different rados pools) as a way to
direct object uploads to an appropriately-configured pool depending on
the object's size. for example, all objects under 1M would be assigned
to a SMALL storage class, while the rest go to LARGE. doing this
directly is tricky, because http requests don't always tell us the
full object size up front. this strategy could also lead to confusion
in s3 applications, because the storage class is a visible part of the
protocol and clients expect to have control over it
you can read more about storage classes and rgw pool placement in
https://docs.ceph.com/en/latest/radosgw/placement/. essentially, each
bucket chooses a 'placement target' on creation, and that placement
target defines which storage classes are available for its object
uploads. each storage class defines the rados pool to use for the
object data. each placement target has a default storage class called
STANDARD which is used for object uploads that don't specify a storage
class. this STANDARD pool is also used to store all of the bucket's
head objects, regardless of their storage class. objects uploaded to
the STANDARD storage class store up to 4MB of data in the head object,
and the rest in tail objects of the same pool. objects uploaded to
other storage classes only store metadata in the head object - all of
their data goes in tail objects in their own pool
in today's call, Yehuda made the observation that for this use case,
it would be ideal to put all head objects in a pool with small
min_alloc_size and all tails in larger-sized pools. this way, even
though we don't necessarily know the full object size up front, we'd
still place all small objects in the correctly-sized pool, with larger
objects spilling over into their own tail pools
this doesn't quite match up with our existing implementation though,
because we put the STANDARD storage class' tail objects in the same
pool as the head objects, and other storage classes only store data in
the tails
so i suggested an additional option to specify a 'head object pool' in
the placement target that's independent of its storage classes. when
specified, all head objects would be written to that pool instead,
along with a configurable amount of data. benefits of this strategy
would be that it preserves the storage class behavior that clients
expect, and enables an optional configuration for a space-optimized
head object pool
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx