This has been my modus operandi when replacing drives. Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy process, and in the case of SSD’s, shuffling data adds unnecessary write wear to the disks. When migrating from filestore to bluestore, I would actually forklift an entire failure domain using the below script, and the noout, norebalance, norecover flags. This would keep crush from pushing data around until I had all of the drives replaced, and would then keep crush from trying to recover until I was ready. > # $1 use $ID or osd.id > # $2 use $DATA or /dev/sdx > # $3 use $NVME or /dev/nvmeXnXpX > > sudo systemctl stop ceph-osd@$1.service > sudo ceph-osd -i $1 --flush-journal > sudo umount /var/lib/ceph/osd/ceph-$1 > sudo ceph-volume lvm zap /dev/$2 > ceph osd crush remove osd.$1 > ceph auth del osd.$1 > ceph osd rm osd.$1 > sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3 For a single drive, this would stop it, remove it from crush, make a new one (and let it retake the old/existing osd.id), and then after I unset the norebalance/norecover flags, then it backfills from the other copies to the replaced drive, and doesn’t move data around. That script is specific for filestore to bluestore somewhat, as the flush-journal command is no longer used in bluestore. Hope thats helpful. Reed > On Aug 6, 2018, at 9:30 AM, Richard Hesketh <richard.hesketh@xxxxxxxxxxxx> wrote: > > Waiting for rebalancing is considered the safest way, since it ensures > you retain your normal full number of replicas at all times. If you take > the disk out before rebalancing is complete, you will be causing some > PGs to lose a replica. That is a risk to your data redundancy, but it > might be an acceptable one if you prefer to just get the disk replaced > quickly. > > Personally, if running at 3+ replicas, briefly losing one isn't the end > of the world; you'd still need two more simultaneous disk failures to > actually lose data, though one failure would cause inactive PGs (because > you are running with min_size >= 2, right?). If running pools with only > two replicas at size = 2 I absolutely would not remove a disk without > waiting for rebalancing unless that disk was actively failing so badly > that it was making rebalancing impossible. > > Rich > > On 06/08/18 15:20, Josef Zelenka wrote: >> Hi, our procedure is usually(assured that the cluster was ok the >> failure, with 2 replicas as crush rule) >> >> 1.Stop the OSD process(to keep it from coming up and down and putting >> load on the cluster) >> >> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think - >> can be set manually but i let it happen by itself) >> >> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove, >> ceph osd rm) >> >> 4. note down the journal partitions if needed >> >> 5. umount drive, replace the disk with new one >> >> 6. ensure permissions are set to ceph:ceph in /dev >> >> 7. mklabel gpt on the new drive >> >> 8. create the new osd with ceph-disk prepare(automatically adds it to >> the crushmap) >> >> >> your procedure sounds reasonable to me, as far as i'm concerned you >> shouldn't have to wait for rebalancing after you remove the osd. all >> this might not be 100% per ceph books but it works for us :) >> >> Josef >> >> >> On 06/08/18 16:15, Iztok Gregori wrote: >>> Hi Everyone, >>> >>> Which is the best way to replace a failing (SMART Health Status: >>> HARDWARE IMPENDING FAILURE) OSD hard disk? >>> >>> Normally I will: >>> >>> 1. set the OSD as out >>> 2. wait for rebalancing >>> 3. stop the OSD on the osd-server (unmount if needed) >>> 4. purge the OSD from CEPH >>> 5. physically replace the disk with the new one >>> 6. with ceph-deploy: >>> 6a zap the new disk (just in case) >>> 6b create the new OSD >>> 7. add the new osd to the crush map. >>> 8. wait for rebalancing. >>> >>> My questions are: >>> >>> - Is my procedure reasonable? >>> - What if I skip the #2 and instead to wait for rebalancing I directly >>> purge the OSD? >>> - Is better to reweight the OSD before take it out? >>> >>> I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain >>> is host. >>> >>> Thanks, >>> Iztok >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com