Re: Best way to replace OSD

Reed Dier <reed.dier@xxxxxxxxxxx> · Mon, 6 Aug 2018 11:05:22 -0500

This has been my modus operandi when replacing drives.

Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy process, and in the case of SSD’s, shuffling data adds unnecessary write wear to the disks.

When migrating from filestore to bluestore, I would actually forklift an entire failure domain using the below script, and the noout, norebalance, norecover flags.

This would keep crush from pushing data around until I had all of the drives replaced, and would then keep crush from trying to recover until I was ready.

> # $1 use $ID or osd.id
> # $2 use $DATA or /dev/sdx
> # $3 use $NVME or /dev/nvmeXnXpX
> 
> sudo systemctl stop ceph-osd@$1.service
> sudo ceph-osd -i $1 --flush-journal
> sudo umount /var/lib/ceph/osd/ceph-$1
> sudo ceph-volume lvm zap /dev/$2
> ceph osd crush remove osd.$1
> ceph auth del osd.$1
> ceph osd rm osd.$1
> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3

For a single drive, this would stop it, remove it from crush, make a new one (and let it retake the old/existing osd.id), and then after I unset the norebalance/norecover flags, then it backfills from the other copies to the replaced drive, and doesn’t move data around.
That script is specific for filestore to bluestore somewhat, as the flush-journal command is no longer used in bluestore.

Hope thats helpful.

Reed

> On Aug 6, 2018, at 9:30 AM, Richard Hesketh <richard.hesketh@xxxxxxxxxxxx> wrote:
> 
> Waiting for rebalancing is considered the safest way, since it ensures
> you retain your normal full number of replicas at all times. If you take
> the disk out before rebalancing is complete, you will be causing some
> PGs to lose a replica. That is a risk to your data redundancy, but it
> might be an acceptable one if you prefer to just get the disk replaced
> quickly.
> 
> Personally, if running at 3+ replicas, briefly losing one isn't the end
> of the world; you'd still need two more simultaneous disk failures to
> actually lose data, though one failure would cause inactive PGs (because
> you are running with min_size >= 2, right?). If running pools with only
> two replicas at size = 2 I absolutely would not remove a disk without
> waiting for rebalancing unless that disk was actively failing so badly
> that it was making rebalancing impossible.
> 
> Rich
> 
> On 06/08/18 15:20, Josef Zelenka wrote:
>> Hi, our procedure is usually(assured that the cluster was ok the
>> failure, with 2 replicas as crush rule)
>> 
>> 1.Stop the OSD process(to keep it from coming up and down and putting
>> load on the cluster)
>> 
>> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
>> can be set manually but i let it happen by itself)
>> 
>> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
>> ceph osd rm)
>> 
>> 4. note down the journal partitions if needed
>> 
>> 5. umount drive, replace the disk with new one
>> 
>> 6. ensure permissions are set to ceph:ceph in /dev
>> 
>> 7. mklabel gpt on the new drive
>> 
>> 8. create the new osd with ceph-disk prepare(automatically adds it to
>> the crushmap)
>> 
>> 
>> your procedure sounds reasonable to me, as far as i'm concerned you
>> shouldn't have to wait for rebalancing after you remove the osd. all
>> this might not be 100% per ceph books but it works for us :)
>> 
>> Josef
>> 
>> 
>> On 06/08/18 16:15, Iztok Gregori wrote:
>>> Hi Everyone,
>>> 
>>> Which is the best way to replace a failing (SMART Health Status:
>>> HARDWARE IMPENDING FAILURE) OSD hard disk?
>>> 
>>> Normally I will:
>>> 
>>> 1. set the OSD as out
>>> 2. wait for rebalancing
>>> 3. stop the OSD on the osd-server (unmount if needed)
>>> 4. purge the OSD from CEPH
>>> 5. physically replace the disk with the new one
>>> 6. with ceph-deploy:
>>> 6a   zap the new disk (just in case)
>>> 6b   create the new OSD
>>> 7. add the new osd to the crush map.
>>> 8. wait for rebalancing.
>>> 
>>> My questions are:
>>> 
>>> - Is my procedure reasonable?
>>> - What if I skip the #2 and instead to wait for rebalancing I directly
>>> purge the OSD?
>>> - Is better to reweight the OSD before take it out?
>>> 
>>> I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain
>>> is host.
>>> 
>>> Thanks,
>>> Iztok
>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com