Hi, have you checked the output of "ceph-disk list” on the nodes where the OSDs are not coming back on? This should give you a hint on what’s going one. Also use dmesg to search for any error message And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages produced by the OSD itself when it starts. Regards JC > On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi, > > I am not able to start some of the OSDs in the cluster. > > This is a test cluster and had 8 OSDs. One node was taken out for maintenance. I set the noout flag and after the server came back up I unset the noout flag. > > Suddenly couple of OSDs went down. > > And now I can start the OSDs manually from each node, but the status is still "down" > > $ ceph osd stat > 8 osds: 2 up, 5 in > > > $ ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 7.97388 root default > -3 1.86469 host a1-osd > 1 ssd 1.86469 osd.1 down 0 1.00000 > -5 0.87320 host a2-osd > 2 ssd 0.87320 osd.2 down 0 1.00000 > -7 0.87320 host a3-osd > 4 ssd 0.87320 osd.4 down 1.00000 1.00000 > -9 0.87320 host a4-osd > 8 ssd 0.87320 osd.8 up 1.00000 1.00000 > -11 0.87320 host a5-osd > 12 ssd 0.87320 osd.12 down 1.00000 1.00000 > -13 0.87320 host a6-osd > 17 ssd 0.87320 osd.17 up 1.00000 1.00000 > -15 0.87320 host a7-osd > 21 ssd 0.87320 osd.21 down 1.00000 1.00000 > -17 0.87000 host a8-osd > 28 ssd 0.87000 osd.28 down 0 1.00000 > > Also can see this error in each OSD node. > > # systemctl status ceph-osd@1 > ● ceph-osd@1.service - Ceph object storage daemon osd.1 > Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) > Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 PDT; 19min ago > Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT) > Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) > Main PID: 4163 (code=killed, signal=ABRT) > > Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered failed state. > Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed. > Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff time over, scheduling restart. > Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too quickly for ceph-osd@1.service > Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object storage daemon osd.1. > Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered failed state. > Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed. > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com