Re: How to replace a disk with minimal impact on performance

Janne Johansson <icepic.dz@xxxxxxxxx> · Fri, 8 Dec 2023 09:55:41 +0100

>
> Based on our observation of the impact of the balancer on the
> performance of the entire cluster, we have drawn conclusions that we
> would like to discuss with you.
>
>      - A newly created pool should be balanced before being handed over
> to the user. This, I believe, is quite evident.
>

I think this question might contain a lot of hidden assumptions, so it's
hard to respond to in a correct manner. Using rgw means you get some
7-10-13 different pools depending on if you use either swift/s3 or all at
the same time. In this case, only one or a few of those pools need care
before doing bulk work, the rest are quite fine being very small and ..
"unbalanced".

>      - When replacing a disk, it is advisable to exchange it directly
> for a new one. As soon as the OSD replacement occurs, the balancer
> should be invoked to realign any improperly placed PGs during the disk
> outage and disk recovery.
>

Not that I think the default behaviours are optimal in any way, but the
above text seems to describe what actually does happen, even if the
balancer may not be involved, the normal crush "repairs" of an imbalanced
cluster will even the data out when the new OSD is in place.

     Perhaps an even better method is to pause recovery and backfilling
> before removing the disk, remove the disk itself, promptly add a new
> one, and then resume recovery and backfilling. It's essential to per
> form all of this as quickly as possible (using a script).
>

Here I would just state "set norebalance (and noout if you must stop the
whole OSD host) before removing the old and adding the new OSD", then when
the new OSD is created and started, you unset the options and let it repair
back to the newly added OSD.

> Ad. We are using a community balancer developed by Jonas Jelton because
> the built-in one does not meet our requirements.
>

We sometimes use the python or go upmap remapper scripts/programs to have
the cluster be less sad while moving a small number of PGs at a time, but
that is more or less just for convenience and to let scrubs run on the
non-moving PGs if the data movements are expected to take long calendar
time.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx