Re: Substitute a predicted failure (not yet failed) osd

Christian Balzer <chibi@xxxxxxx> · Mon, 15 Aug 2016 12:03:12 +0900

Hello,

If we go by the subject line, your data is still all there and valid (or
at least mostly valid).
Also, is that an actual RAID0, with multiple drives?
If so, why? 
That just massively increases your failure probabilities AND the amount of
affected data when it fails.

Anyway, if that OSD is still working:

1. noout
2. stop osd
3. copy the data 100% off (dd, cp -a, rsync -a)
4. replace disk(s)
5. copy the data back in
6. start osd
7. unset noout

Christian

On Mon, 15 Aug 2016 02:50:31 +0000 David Turner wrote:

> If you are trying to reduce extra data movement, set and unset the nobackfill and norecover flags when you do the same for noout.  You will want to follow the instructions to fully remove the osd from the cluster including outing the osd, removing it from the crush map, removing it's auth from the cluster, and finally remove the osd from the cluster.  After that, adding the osd back in should give it the same osd id that the former one had.  If you make sure that the id is the same and the weight in the crush map is the same (you can do this by saving your crush map before you remove the osd and uploading the same crush map after you add it back in with the same id) then the only data movement will be onto the re-added osd and nothing else.
> 
> ________________________________
> 
> [cid:image84bd5c.JPG@2e892687.44af8e6f]<https://storagecraft.com>       David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
> 
> ________________________________
> 
> If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.
> 
> ________________________________
> 
> ________________________________________
> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Goncalo Borges [goncalo.borges@xxxxxxxxxxxxx]
> Sent: Sunday, August 14, 2016 5:47 AM
> To: ceph-users@xxxxxxxx
> Subject:  Substitute a predicted failure (not yet failed) osd
> 
> Hi cephfers
> 
> I have a really simple question: the documentation always refers to the procedure to substitute failed disks. Currently I have a predicted failure in a raid 0 osd and I would like to substitute before it fails without having to go by replicating pgs once the osd is removed from crush map, and then, replicating again once I add the new drive.
> 
> Can I perform the following actions safely  to achieve my goal?
> 
> # ceph osd set noout
> # stop the osd
> # unmount the osd
> # remove it from crush map
> # substitute the drive
> # recreate the osd
> # ceph osd unset noout
> 
> Cheers
> Goncalo
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com