Re: [ceph-users] jj's "improved" ceph balancer

Jonas Jelten <jelten@xxxxxxxxx> · Fri, 22 Oct 2021 12:30:00 +0200

Hi!
How would you balance the workload? We could distribute PGs independently of the OSD sizes, assuming that a HDD can 
handle a low-and-constant number of iops, say 250, no matter how big it is. If we distribute pgs just by predicted 
device iops, we would optimize for workload better.

My balancer (and mgr-balancer) calculate the ideal pg/size rate for a pool, and multiply this by device size.
The pgs-per-size is the main decision for movements in mgr-balancer, for jj-balancer it's just another constraint that 
has to be fulfilled.

pgs-per-iops and pgs-per-size are probably the more or less the same for NVMe devices, but definitely not for HDDs, so 
different optimization strategies would be needed.

-- Jonas

On 20/10/2021 20.44, Josh Salomon wrote:
Just another point of view:
The current balancer balances the capacity but this is not enough. The balancer should also balance the workload and we 
plan on adding primary balancing for Quincy. In order to balance the workload you should work pool by pool because pools 
have different workloads. So while the observation about the +1 PGs is correct, I believe the correct solution should be 
talking this into consideration while still balancing capacity pool by pool.
Capacity balancing is a functional requirement, while workload balancing is a performance requirement so it is important 
only for very loaded systems (loaded in terms of high IOPS not nearly full systems)

I would appreciate comments on this thought.

On Wed, 20 Oct 2021, 20:57 Dan van der Ster, <dan@xxxxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxxxx>> wrote:

    Hi Jonas,

     From your readme:

    "the best possible solution is some OSDs having an offset of 1 PG to the ideal count. As a
    PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will
    be the +1 more often than others. At worst one OSD is the +1 for each pool in the cluster."

    That's an interesting observation/flaw which hadn't occurred to me before. I think we don't ever see it in practice
    in our clusters because we do not have multiple large pools on the same osds.

    How large are the variances in your real clusters? I hope the example in your readme isn't from real life??

    Cheers, Dan

    On Wed, 20 Oct 2021, 15:11 Jonas Jelten, <jelten@xxxxxxxxx <mailto:jelten@xxxxxxxxx>> wrote:

        Hi!

        I've been working on this for quite some time now and I think it's ready for some broader testing and feedback.

        https://github.com/TheJJ/ceph-balancer <https://github.com/TheJJ/ceph-balancer>

        It's an alternative standalone balancer implementation, optimizing for equal OSD storage utilization and PG
        placement across all pools.

        It doesn't change your cluster in any way, it just prints the commands you can run to apply the PG movements.
        Please play around with it :)

        Quickstart example: generate 10 PG movements on hdd to stdout

             ./placementoptimizer.py -v balance --max-pg-moves 10 --only-crushclass hdd | tee /tmp/balance-upmaps

        When there's remapped pgs (e.g. by applying the above upmaps), you can inspect progress with:

             ./placementoptimizer.py showremapped
             ./placementoptimizer.py showremapped --by-osd

        And you can get a nice Pool and OSD usage overview:

             ./placementoptimizer.py show --osds --per-pool-count --sort-utilization

        Of course there's many more features and optimizations to be added,
        but it served us very well in reclaiming terrabytes of until then unavailable storage already where the `mgr
        balancer` could no longer optimize.

        What do you think?

        Cheers
           -- Jonas
        _______________________________________________
        ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
        To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>

    _______________________________________________
    Dev mailing list -- dev@xxxxxxx <mailto:dev@xxxxxxx>
    To unsubscribe send an email to dev-leave@xxxxxxx <mailto:dev-leave@xxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx