Re: rgw: matching small objects to pools with small min_alloc_size

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 18 Aug 2021 16:52:11 -0400

On Wed, Aug 18, 2021 at 4:20 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
> Hi Casey,
>
>
> A while back Igor refactored the code in bluestore to allow us to have
> small min_alloc sizes on HDDs without a significant performance penalty
> (this was really great work btw Igor!).  The default now is a 4k
> min_alloc_size on both NVMe and HDD:
>
> https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284
>
>
> There was a bug causing part of this change to increase write
> amplification dramatically on the DB/WAL device, but this has been
> (mostly) fixed as of last week.  It will still likely be somewhat higher
> than in Nautilus (not clear yet how much this is due to more metadata vs
> unnecessary deferred write flushing/compaction), but the space
> amplification benefit is very much worth it.
>
>
> Mark

thanks Mark! sorry i didn't capture much of the background here. we've
been working on this with Anthony from Intel (cc'ed), who summarized
it this way:

* Coarse IU QLC SSDs are an appealing alternative to HDDs for Ceph,
notably for RGW bucket data
* BlueStore’s min_alloc_size is best aligned to the IU for performance
and endurance; today that means 16KB or 64KB depending on the drive
model
* That means that small RGW objects can waste a significant amount of
space, especially when EC is used
* Multiple bucket data pools with appropriate media can house small vs
large objects via StorageClasses, but today this requires consistent
user action, which is often infeasible.

so the goal isn't to reduce the alloc size for small objects, but to
increase it for the large objects

>
>
> On 8/18/21 2:38 PM, Casey Bodley wrote:
> > in the rgw refactoring meetings, we've been discussing ways to improve
> > space utilization for workloads of mixed object sizes
> >
> > i think it's worth bring this up in Mark's performance call as well,
> > to explore other options from the osd/librados perspective
> >
> > most of our discussion so far has centered around ways to use s3's
> > storage classes (which rgw maps to different rados pools) as a way to
> > direct object uploads to an appropriately-configured pool depending on
> > the object's size. for example, all objects under 1M would be assigned
> > to a SMALL storage class, while the rest go to LARGE. doing this
> > directly is tricky, because http requests don't always tell us the
> > full object size up front. this strategy could also lead to confusion
> > in s3 applications, because the storage class is a visible part of the
> > protocol and clients expect to have control over it
> >
> > you can read more about storage classes and rgw pool placement in
> > https://docs.ceph.com/en/latest/radosgw/placement/. essentially, each
> > bucket chooses a 'placement target' on creation, and that placement
> > target defines which storage classes are available for its object
> > uploads. each storage class defines the rados pool to use for the
> > object data. each placement target has a default storage class called
> > STANDARD which is used for object uploads that don't specify a storage
> > class. this STANDARD pool is also used to store all of the bucket's
> > head objects, regardless of their storage class. objects uploaded to
> > the STANDARD storage class store up to 4MB of data in the head object,
> > and the rest in tail objects of the same pool. objects uploaded to
> > other storage classes only store metadata in the head object - all of
> > their data goes in tail objects in their own pool
> >
> > in today's call, Yehuda made the observation that for this use case,
> > it would be ideal to put all head objects in a pool with small
> > min_alloc_size and all tails in larger-sized pools. this way, even
> > though we don't necessarily know the full object size up front, we'd
> > still place all small objects in the correctly-sized pool, with larger
> > objects spilling over into their own tail pools
> >
> > this doesn't quite match up with our existing implementation though,
> > because we put the STANDARD storage class' tail objects in the same
> > pool as the head objects, and other storage classes only store data in
> > the tails
> >
> > so i suggested an additional option to specify a 'head object pool' in
> > the placement target that's independent of its storage classes. when
> > specified, all head objects would be written to that pool instead,
> > along with a configurable amount of data. benefits of this strategy
> > would be that it preserves the storage class behavior that clients
> > expect, and enables an optional configuration for a space-optimized
> > head object pool
> >
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
> >
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx