It seems that I have been able to workaround my issues. I’ve attempted to reproduce by rebooting nodes and using the stop all OSDs wait a bit and start them. At this time, no OSDs are crashing like before. OSDs seem to have no problems starting either. What I did is remove completely the OSDs one at a time and reissue them allowing CEPH 14.2.1 to reengineer them.
Remove a disk: 1.) see which OSD is which disk: sudo ceph-volume lvm list 2.) ceph osd out X EX: synergy@synergy3:~$ ceph osd out 21 marked out osd.21. 2.a) ceph osd down osd.X Ex: ceph osd down osd.21 2.aa) Stop OSD daemon: sudo systemctl stop ceph-osd@X EX: sudo systemctl stop ceph-osd@21 2.b) ceph osd rm osd.X EX: ceph osd rm osd.21 3.) check status : ceph -s 4.)Observe data migration: ceph -w 5.) remove from CRUSH: ceph osd crush remove {name} EX: ceph osd crush remove osd.21 5.b) del auth: ceph auth del osd.21 6.) find info on disk: sudo hdparm -I /dev/sdd 7.) see sata ports: lsscsi --verbose 8.) Go pull the disk and replace it, or not and do the following steps to re-use it. additional steps to remove and reuse a disk: (without ejecting, as ejecting and replace drops this for us) (do this last after following the CEPH docs for remove a disk.) 9.) sudo gdisk /dev/sdX (x,z,Y,Y) 9.a) 94 lsblk 95 dmsetup remove ceph--e36dc03d--bf0d--462a--b4e6--8e49819bec0b-osd--block--d5574ac1--f72f--4942--8f4a--ac24891b2ee6 10.) deploy a /dev/sdX disk: from 216.106.44.209 (ceph-mon0) you must be in the "my_cluster" folder: EX: Synergy@Ceph-Mon0:~/my_cluster$ ceph-deploy osd create --data /dev/sdd synergy1
I have attached my doc I use to accomplish this. *BEfore I do it, I mark the OSD as “out” via the GUI or CLI and allow it to reweight to 0%, this is monitored via Ceph -s. I do this so that there is not an actual disk fail which then puts me into dual disk fail when I’m rebuilding an OSD. -Edward Kalk
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com