On Fri, Oct 20, 2017 at 6:32 AM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > >>> have you checked the output of "ceph-disk list” on the nodes where the >>> OSDs are not coming back on? > > Yes, it shows all the disk correctly mounted. > >>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages >>> produced by the OSD itself when it starts. > > This is the error messages seen in one of the OSD log file. Even though the > service is starting the status shows as down itself. > > > ============================= > > -7> 2017-10-19 13:16:15.589465 7efefcda4d00 5 osd.28 pg_epoch: 4312 > pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 > les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown > NOTIFY] enter Reset > -6> 2017-10-19 13:16:15.589476 7efefcda4d00 5 write_log_and_missing > with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, > writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , > clear_divergent_priors: 0 > -5> 2017-10-19 13:16:15.591629 7efefcda4d00 5 osd.28 pg_epoch: 4312 > pg[33.10(unlocked)] enter Initial > -4> 2017-10-19 13:16:15.591759 7efefcda4d00 5 osd.28 pg_epoch: 4312 > pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 > les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown > NOTIFY] exit Initial 0.000130 0 0.000000 > -3> 2017-10-19 13:16:15.591786 7efefcda4d00 5 osd.28 pg_epoch: 4312 > pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 > les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown > NOTIFY] enter Reset > -2> 2017-10-19 13:16:15.591799 7efefcda4d00 5 write_log_and_missing > with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, > writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , > clear_divergent_priors: 0 > -1> 2017-10-19 13:16:15.594757 7efefcda4d00 5 osd.28 pg_epoch: 4306 > pg[32.ds0(unlocked)] enter Initial > 0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: > In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' > thread 7efefcda4d00 time 2017-10-19 13:16:15.594821 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: > 38: FAILED assert(stripe_width % stripe_size == 0) What does your erasure code profile look like for pool 32? > > > > On 20-10-2017 01:05, Jean-Charles Lopez wrote: >> >> Hi, >> >> have you checked the output of "ceph-disk list” on the nodes where the >> OSDs are not coming back on? >> >> This should give you a hint on what’s going one. >> >> Also use dmesg to search for any error message >> >> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages >> produced by the OSD itself when it starts. >> >> Regards >> JC >> >>> On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> Hi, >>> >>> I am not able to start some of the OSDs in the cluster. >>> >>> This is a test cluster and had 8 OSDs. One node was taken out for >>> maintenance. I set the noout flag and after the server came back up I unset >>> the noout flag. >>> >>> Suddenly couple of OSDs went down. >>> >>> And now I can start the OSDs manually from each node, but the status is >>> still "down" >>> >>> $ ceph osd stat >>> 8 osds: 2 up, 5 in >>> >>> >>> $ ceph osd tree >>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>> -1 7.97388 root default >>> -3 1.86469 host a1-osd >>> 1 ssd 1.86469 osd.1 down 0 1.00000 >>> -5 0.87320 host a2-osd >>> 2 ssd 0.87320 osd.2 down 0 1.00000 >>> -7 0.87320 host a3-osd >>> 4 ssd 0.87320 osd.4 down 1.00000 1.00000 >>> -9 0.87320 host a4-osd >>> 8 ssd 0.87320 osd.8 up 1.00000 1.00000 >>> -11 0.87320 host a5-osd >>> 12 ssd 0.87320 osd.12 down 1.00000 1.00000 >>> -13 0.87320 host a6-osd >>> 17 ssd 0.87320 osd.17 up 1.00000 1.00000 >>> -15 0.87320 host a7-osd >>> 21 ssd 0.87320 osd.21 down 1.00000 1.00000 >>> -17 0.87000 host a8-osd >>> 28 ssd 0.87000 osd.28 down 0 1.00000 >>> >>> Also can see this error in each OSD node. >>> >>> # systemctl status ceph-osd@1 >>> ● ceph-osd@1.service - Ceph object storage daemon osd.1 >>> Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; >>> vendor preset: disabled) >>> Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 >>> PDT; 19min ago >>> Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id >>> %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT) >>> Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh >>> --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) >>> Main PID: 4163 (code=killed, signal=ABRT) >>> >>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service >>> entered failed state. >>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed. >>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff >>> time over, scheduling restart. >>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too >>> quickly for ceph-osd@1.service >>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object >>> storage daemon osd.1. >>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service >>> entered failed state. >>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed. >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com