How to manuall take down an osd

Rainer Krienke <krienke@xxxxxxxxxxxxxx> · Mon, 7 Nov 2022 09:20:44 +0100

Hi,

today morning I had osd.77 in my ceph nautilus cluster with 144 OSDs on 
9 hosts that seemed to not be working correctly , it caused slow ops:

ceph -s
  cluster:
    id:     7397a0cf-bfc6-4d25-aabb-be9f6564a13b
    health: HEALTH_WARN
            Reduced data availability: 6 pgs inactive, 8 pgs peering
            62 slow ops, oldest one blocked for 2703 sec, osd.77 has 
slow ops

Bevore I had installed ubuntu security updates and then rebooted the 
host with osd.77. Already before rebooting I saw some read errors on the 
console, so probably the disk of osd.77 is dying. I had a similar 
cluster behaviour some months ago where also an osd had slow ops and 
half a day later it died so I could replace the disk.

With this history I now wanted to take osd.77 down and out of the 
cluster to replace the disk. Now I was unsure about how to do this. I 
thought, this should be corect:

# osd down 77
# osd out  77
# osd destroy 77

Would this be in general the right way to prepare a disk replacement?

Now strange but good things happened after the: "ceph osd down 77". 
There was no error running this command, but then "ceph -s" showed still 
all osds up and in. I had expected that one OSD should be down now, but 
it wasn't.
And even more strange the problems with slow ops from osd.77 are also 
gone for the moment and the cluster is completely healthy again.

Thanks for your help
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx