Re: Substitute a predicted failure (not yet failed) osd

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Mon, 15 Aug 2016 13:34:40 +1000

Hi Christian

If we go by the subject line, your data is still all there and valid (or
at least mostly valid).
Also, is that an actual RAID0, with multiple drives?
If so, why?

Its a RAID 0 of one disk. The controller we use just gives that only 
possibility to use single drives.

Cheers
G.

That just massively increases your failure probabilities AND the amount of
affected data when it fails.

Anyway, if that OSD is still working:

1. noout
2. stop osd
3. copy the data 100% off (dd, cp -a, rsync -a)
4. replace disk(s)
5. copy the data back in
6. start osd
7. unset noout

Christian

On Mon, 15 Aug 2016 02:50:31 +0000 David Turner wrote:

If you are trying to reduce extra data movement, set and unset the nobackfill and norecover flags when you do the same for noout.  You will want to follow the instructions to fully remove the osd from the cluster including outing the osd, removing it from the crush map, removing it's auth from the cluster, and finally remove the osd from the cluster.  After that, adding the osd back in should give it the same osd id that the former one had.  If you make sure that the id is the same and the weight in the crush map is the same (you can do this by saving your crush map before you remove the osd and uploading the same crush map after you add it back in with the same id) then the only data movement will be onto the re-added osd and nothing else.

________________________________

[cid:image84bd5c.JPG@2e892687.44af8e6f]<https://storagecraft.com>       David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943

________________________________

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

________________________________

________________________________________
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Goncalo Borges [goncalo.borges@xxxxxxxxxxxxx]
Sent: Sunday, August 14, 2016 5:47 AM
To: ceph-users@xxxxxxxx
Subject:  Substitute a predicted failure (not yet failed) osd

Hi cephfers

I have a really simple question: the documentation always refers to the procedure to substitute failed disks. Currently I have a predicted failure in a raid 0 osd and I would like to substitute before it fails without having to go by replicating pgs once the osd is removed from crush map, and then, replicating again once I add the new drive.

Can I perform the following actions safely  to achieve my goal?

# ceph osd set noout
# stop the osd
# unmount the osd
# remove it from crush map
# substitute the drive
# recreate the osd
# ceph osd unset noout

Cheers
Goncalo

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com