Re: Cephadm Drive upgrade process

Eugen Block <eblock@xxxxxx> · Tue, 12 Nov 2024 08:25:23 +0000

Hi,

I would be nice if we could just copy the content to the new drive  
and go from there.

that's exactly what we usually do, we add a new drive and 'pvmove' the  
contents of the failing drive. The worst thing so far is that the  
orchestrator still thinks it's /dev/sd{previous_letter}, but I don't  
care too much about that. Note that we don't use the  
--all-available-devices to have full control over OSD deployment. But  
you could pause the orchestrator to prevent it from building a new OSD  
on that drive until the pvmove finished.

Regards,
Eugen

Zitat von Anthony D'Atri <anthony.datri@xxxxxxxxx>:

1.	Pulled failed drive ( after troubleshooting of course )

2.	Cephadm gui - find OSD, purge osd
3.	Wait for rebalance
4.	Insert new drive ( let cluster rebalance after it automatically adds
the drive as an OSD ) ( yes, we have auto-add on in the clusters )

I imagine with an existing good drive, we would use delete instead of purge,
but the process would be the similar, except the drive swap would happen
after the data was moved.

You don’t have to wait for the rebalance / backfill / recovery, at  
least if you do one drive (or failure domain) at a time.

In fact you can be more efficient by not waiting, as deploying the  
new OSD will short-circuit some of that data movement from the  
deletion.

 Would the replace flag ( or keep OSD option in gui ) allow us to  
avoid the initial rebalance by backfilling the new drive
with the old drives content?

Only if you set the new drive’s CRUSH weight artificially low, to  
match the old drive’s weight exactly.  But when you weight it up  
fully, data will move anyway.

I would be nice if we could just copy the content to the new drive  
and go from there.

I get your drift, but there’s a nuance.  Because of how CRUSH works,  
the data that the 20TB OSD will eventually hold will not be a proper  
superset of what’s on the 4TB OSD today.  Data will also shuffle on  
other OSDs.

Be careful that you only delete / destroy OSDs in a single failure  
domain at a time, and wait for full recovery before proceeding to  
the next failure domain.   If you are short on capacity, you might  
want to do a small number of drives in one failure domain, wait for  
recovery, then move to the next failure domain, as you will only  
realize additional cluster capacity once you’ve added CRUSH weight  
to at least 3 failure domains.

We would like to avoid lots of read/write cluster recovery activity  
if possible sine we could be replacing
40+ drives in each cluster.

Part of the false economy of HDDs :-/. But again, attend to your  
failure domains.  If you are only replacing *some* off the smaller  
drives, spread that across failure domains or you won’t gain any  
actual capacity.  And be prepared for the larger drives to get a  
proportionally larger fraction of workload, which can be somewhat  
mitigated with primary affinity but that’s a bit of an advanced topic.

Related advice:  as you add OSDs that are 5x the size of existing  
OSDs, you run the risk of hitting mon_max_pg_per_osd on the larger  
OSDs.  This defaults to 250.  I suggest setting it to 1000 before  
starting this project, to avoid larger OSDs that won’t activate.

https://github.com/ceph/ceph/pull/60492

Also, you might temporarily disable the balancer and use pgremapper  
or  
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py to minimize extra backfill and control its rate.  You would use the tool to effective freeze all PG mappings, then destroy / redeploy as many OSDs as you like *within a single failure domain*, and gradually remove the manual upmaps at a rate informed by how much backfill your spinners can handle.  There are lots of articles and list posts about this strategy.  This lets you leapfrog the transient churn as multiple OSDs are removed / added, and control the thundering herd of recovery that can DoS  
spinners.

If you’re using the pg autoscaler, I might ensure that all affected  
pools have the ‘bulk’ flag set in advance, so that you don’t have PG  
splitting / merging and backfill/recovery going on at the same time.

US Production(HDD): Reef 18.2.4 Cephadm with 11 osd servers, 5 mons, 4 rgw,
2 iscsigw, 2 mds

UK Production(HDD): Reef 18.2.4 Cephadm with 20 osd servers, 5 mons, 4 rgw,
2 iscsigw, 2mds

US Production(SSD): Reef 18.2.4 Cephadm with 6 osd servers, 5 mons, 4 rgw, 2
mds

UK Production(SSD): Reef 18.2.4 cephadm with 6 osd servers, 5 mons, 4 rgw, 2
mds

I suspect FWIW that you would benefit from running an RGW on more  
servers - any of them that have enough CPU / RAM.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx