Re: What's the best way to add numerous OSDs?

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Tue, 6 Aug 2024 20:08:35 -0400

Since they’re 20TB, I’m going to assume that these are HDDs.

There are a number of approaches.  One common theme is to avoid rebalancing until after all have been added to the cluster and are up / in, otherwise you can end up with a storm of map updates and superfluous rebalancing.

One strategy is to set osd_crush_initial_weight = 0 temporarily, so that the OSDs when added won’t take any data yet.  Then when you’re ready you can set their CRUSH weights up to where they otherwise would be, and unset osd_crush_initial_weight so you don’t wonder what the heck is going on six months down the road.

Another is to add a staging CRUSH root.  If the new OSDs are all on new hosts, you can create CRUSH host buckets for them in advance so that when you create the OSDs they go there and again won’t immediately take data.  Then you can move the host buckets into the production root in quick succession.

Either way if you do want to add them to the cluster all at once, with HDDs you’ll want to limit the rate of backfill so you don’t DoS your clients.  One strategy is to leverage pg-upmap with a tool like https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Note that to use pg-upmap safely, you will need to ensure that your clients are all at Luminous or later, in the case of CephFS I *think* that means kernel 4.13 or later.  `ceph features` will I think give you that information.

An older method of spreading out the backfill thundering herd was to use a for loop to weight up the OSDs in increments of, say, 0.1 at a time, let the cluster settle, then repeat.  This strategy results in at least some data moving twice, so it’s less efficient.  Similarly you might add, say, one OSD per host at a time and let the cluster settle between iterations, which would also be less than ideally efficient.

— aad

> On Aug 6, 2024, at 11:08 AM, Fabien Sirjean <fsirjean@xxxxxxxxxxxx> wrote:
> 
> Hello everyone,
> 
> We need to add 180 20TB OSDs to our Ceph cluster, which currently consists of 540 OSDs of identical size (replicated size 3).
> 
> I'm not sure, though: is it a good idea to add all the OSDs at once? Or is it better to add them gradually?
> 
> The idea is to minimize the impact of rebalancing on the performance of CephFS, which is used in production.
> 
> Thanks in advance for your opinions and feedback 🙂
> 
> Wishing you a great summer,
> 
> Fabien
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx