Re: rgw: matching small objects to pools with small min_alloc_size

Matthias Muench <mmuench@xxxxxxxxxx> · Thu, 19 Aug 2021 10:40:53 +0200

    Hi all,
some rough thoughts ...

- not form code perspective but from usability:

Distributing objects based on their (perceived) size across different pools in an opaque way may break data placement decisions or at least make those more complicated to configure those properly in medium sized clusters and may further complicate also load distribution (and assessment and design) because of unknown distribution scheme.
    Where I could think that somebody could easily live with placement
    of head objects into a special pool which can be specifically
    designed, the distribution based on object size might also
    complicate the configuration for tiering of objects and/or placement
    of objects of different purpose at designated costs. 

    While respecting the different capabilities of devices and
    leveraging knowledge about those seems to be viable, I would rather
    see the duty of proper handling of those in the OSD layer.

    - for the code perspective and handling:

    Further more, we then need to start to distinguish not only between
    HDD and flash in general but also between different kinds of flash.
    In this regard, how HDD backed pools would be used in such a way of
    determining the data placement inside the RGW ? Is then the HDD
    based pool with nowadays 4 K min_alloc_size  more suited for small
    objects, too ? I see there a way of complicating things for future
    development with little outcome over time when there is no good
    prediction possible for the use cases hitting a then given h/w
    configuration. 

    Also, I would see there a need for additional reporting capabilities
    to properly reflect the utilization of such specialized pools (and
    sets of h/w based on CRUSH map) to allow alerting for needed space
    upgrades. While we had no good differentiation for the existing
    device classes and CRUSH map structure, by introducing those
    specialized ones we'll getting more complex here.

    - for the handling of the devices from an OSD perspective:

Introducing special handling for special devices might be tricky for future device types and would require some kind of automation, at least to detect those. It not only introduces special devices in terms of performance but also in specialized handling; the question would be how to handle future h/w specialized on something else - do we want to rely on the availability of those in the future or wouldn't it be better to have a generic way of adapting to it ?
Manual tuning of OSDs towards such special kinds of h/w features might be an acceptable way to capitalize on such things - it would acknowledge that those different behaviors exist and could be respected but wouldn't necessarily require special handling in code tied to this special h/w. 
Introducing a manual tuning of the min_alloc_size per OSD during the configuration could help to avoid such kind of specialized h/w but would also put the burden of setting this appropriately to the admin introducing the special device for a reason. Then, setting the min_alloc_size to something special would change also the performance needs for RocksDB/WAL depending on the config  chosen - this could be a manual decision here but in contrast automation would need to add special code to detect it which needs additional maintenance. Also, I see there a lot of special tuning and investigation needed that is only geared to support those devices if tied to the device type. Although manual tuning opens a different set of degrees of performance tuning challenges, it also allows to base this on best practices (to be documented) but also a way to tweak this when not appropriately matching without immediate code changes.
We changed the way of allocation for HDD in favor of having better capacity utilization on the devices - however, this changes in mixed environments the usability of HDD for fully concurrent mid-sized IOs which is not always a good match, especially for upgrades to existing clusters. A manual tuning could mitigate this as well for any device with special capabilities in data handling, like the new devices discussed.

Thanks,
-matt

    On 19.08.21 09:48, Mark Nelson wrote:

      On 8/18/21 6:50 PM, Curt Bruns wrote:

      Hi Mark / all,

        I'm working with Anthony on this QLC optimization idea at Intel
        and plan on doing the testing/dev work, if needed. Thanks for
        playing devil's advocate - we're brainstorming all ideas right
        now.

        The problem as I understand it is having the data pool on larger
        Indirection Unit (e.g. 64k) QLC drives and when a user PUTs a
        lot of small objects (<< 64k) into this pool, it leads to
        lots of wasted space.  Ideally we would know an object's size on
        upload so we could steer objects >= IU to the QLC pool and
        then smaller objects to a TLC pool.  But, since we don't know
        the object size early enough in time, the suggestion is to put
        HEAD objects into a TLC-backed pool, which could handle the
        smaller objects, and then put tails, which implies a larger
        object, into the QLC-backed pool.  This is ideally without
        having the end S3 user having to configure anything.

      Yep, with larger min_alloc sizes you'll be increasing space-amp
      for small objects.  Definitely a cause for concern!  So regarding
      the idea:  We've been getting requests lately to increase the
      overall pool limits in ceph.  Given that these devices are
      designed to store vast quantities of data, I don't think we should
      take the 2x pool multiplier requirement lightly.  It seems like
      the people that want to use these huge QLC drives may be the same
      people that want lots of pools.  Generally speaking I'm hoping we
      can actually reduce the number of pools RGW needs rather than
      increasing it.  I'm also a little concerned that splitting data
      across pools at the RGW layer is introducing a lot of complexity
      higher up that could be handled down closer to the disk. Sometimes
      it's inevitable that details like device indirection unit get
      exposed higher up the stack, but in this case I'm not sure we
      really need to.

        Your devil's advocate idea is along those lines, I believe, of
        placing the small objects into cache and then larger objects
        into the backing storage device, I believe?  I don't know the
        exact implementation of Open-CAS, but caching to me implies
        temporal storage and if it's the small objects getting cached,
        they'll eventually get evicted/flushed out to the QLC pool which
        then causes our small object problem of wasting QLC space?

      One of the questions I had for the opencas guys a while back was
      whether or not you could provide hints to pin specific data to the
      fast device (say rocksdb L0 if we just wanted to have opencas
      handle the entire device instead of having bluefs use a separate
      partions for DB/WAL).  I believe Intel said this is possible,
      potentially by tagging a specific region of the block device as a
      pinned area.  Maybe one of the opencas folks from Intel can chime
      in here.  So potentially the way this could work would be the OSD
      looks at the size of the object or gets a hint from RGW or
      something, and then bluestore sees that and via bluefs writes the
      object to specific regions of the opencas backed storage so that
      it stays pinned on optane or pdimm just like the associated
      rocksdb metadata.  Alternately we could teach bluefs itself to
      write this kind of data directly to the fast device and
      dm-cache/opencas would not strictly be necessary.  One of the
      things I like about this approach is that bluefs already has to
      think about how to manage space between fast and slow devices (ie
      what to do when there's not enough space on the fast device to
      store SST files).  It seems to me that the problem of small
      objects on QLC is very similar.  There may be cases where you want
      to prioritize small objects being pinned on the fast device rather
      than seldomly accessed SST files (say for L2+).  We can better
      prioritize that by handling at the bluefs layer rather than having
      bluefs and rgw using different approaches to fight over who gets
      to put data on the fast devices.

        There was also some talk in the meeting about using the LUA
        framework to steer objects w/o any user intervention - that's
        something interesting too that I'll look at.

        Any other suggestions or pointers are appreciated!

        Thanks,

        - Curt

        On Wed, Aug 18, 2021 at 2:10 PM Mark Nelson
        <mnelson@xxxxxxxxxx <mailto:mnelson@xxxxxxxxxx>>
        wrote:

            On 8/18/21 3:52 PM, Casey Bodley wrote:

            > On Wed, Aug 18, 2021 at 4:20 PM Mark Nelson
        <mnelson@xxxxxxxxxx

            <mailto:mnelson@xxxxxxxxxx>> wrote:

            >> Hi Casey,

            >>

            >>

            >> A while back Igor refactored the code in bluestore
        to allow us

            to have

            >> small min_alloc sizes on HDDs without a significant
        performance

            penalty

            >> (this was really great work btw Igor!).  The
        default now is a 4k

            >> min_alloc_size on both NVMe and HDD:

            >>

            >>

https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284

<https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L4254-L4284>

            >>

            >>

            >> There was a bug causing part of this change to
        increase write

            >> amplification dramatically on the DB/WAL device,
        but this has been

            >> (mostly) fixed as of last week.  It will still
        likely be

            somewhat higher

            >> than in Nautilus (not clear yet how much this is
        due to more

            metadata vs

            >> unnecessary deferred write flushing/compaction),
        but the space

            >> amplification benefit is very much worth it.

            >>

            >>

            >> Mark

            > thanks Mark! sorry i didn't capture much of the
        background here.

            we've

            > been working on this with Anthony from Intel (cc'ed),
        who summarized

            > it this way:

            >

            > * Coarse IU QLC SSDs are an appealing alternative to
        HDDs for Ceph,

            > notably for RGW bucket data

            > * BlueStore’s min_alloc_size is best aligned to the IU
        for

            performance

            > and endurance; today that means 16KB or 64KB depending
        on the drive

            > model

            > * That means that small RGW objects can waste a
        significant

            amount of

            > space, especially when EC is used

            > * Multiple bucket data pools with appropriate media can
        house

            small vs

            > large objects via StorageClasses, but today this
        requires consistent

            > user action, which is often infeasible.

            >

            > so the goal isn't to reduce the alloc size for small
        objects, but to

            > increase it for the large objects

            Ah!  That makes sense. So to play the devil's advocate: If
        you

            have some

            combination of bulk QLC and a smaller amount of fast high
        endurance

            storage for WAL/DB, could something like dm-cache or opencas
        (or if

            necessarily modifications to bluefs) potentially serve the
        same

            purpose

            without doubling the number of pools required?

            _______________________________________________

            Dev mailing list -- dev@xxxxxxx <mailto:dev@xxxxxxx>

            To unsubscribe send an email to dev-leave@xxxxxxx

            <mailto:dev-leave@xxxxxxx>

      _______________________________________________

      Dev mailing list -- dev@xxxxxxx

      To unsubscribe send an email to dev-leave@xxxxxxx

    -- 
——————————————————
Matthias Muench
Principal Specialist Solution Architect
EMEA Storage Specialist
matthias.muench@xxxxxxxxxx
Phone: +49-160-92654111

Red Hat GmbH
Werner-von-Siemens-Ring 14
85630 Grasbrunn
Germany
_______________________________________________________________________
Red Hat GmbH, http://www.de.redhat.com · Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen
HRB 153243 · Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx