On Fri, Oct 20, 2017 at 7:35 PM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > >>> What does your erasure code profile look like for pool 32? > > $ ceph osd erasure-code-profile get myprofile > crush-device-class= > crush-failure-domain=host > crush-root=default > jerasure-per-chunk-alignment=false > k=5 > m=3 > plugin=jerasure > technique=reed_sol_van > w=8 Sorry, can you post the output of 'ceph osd dump' as well please? > > > On 20-10-2017 06:52, Brad Hubbard wrote: >> >> On Fri, Oct 20, 2017 at 6:32 AM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> Hi, >>> >>>>> have you checked the output of "ceph-disk list” on the nodes where the >>>>> OSDs are not coming back on? >>> >>> Yes, it shows all the disk correctly mounted. >>> >>>>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages >>>>> produced by the OSD itself when it starts. >>> >>> This is the error messages seen in one of the OSD log file. Even though >>> the >>> service is starting the status shows as down itself. >>> >>> >>> ============================= >>> >>> -7> 2017-10-19 13:16:15.589465 7efefcda4d00 5 osd.28 pg_epoch: 4312 >>> pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 >>> les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown >>> NOTIFY] enter Reset >>> -6> 2017-10-19 13:16:15.589476 7efefcda4d00 5 write_log_and_missing >>> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, >>> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: >>> , >>> clear_divergent_priors: 0 >>> -5> 2017-10-19 13:16:15.591629 7efefcda4d00 5 osd.28 pg_epoch: 4312 >>> pg[33.10(unlocked)] enter Initial >>> -4> 2017-10-19 13:16:15.591759 7efefcda4d00 5 osd.28 pg_epoch: 4312 >>> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 >>> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown >>> NOTIFY] exit Initial 0.000130 0 0.000000 >>> -3> 2017-10-19 13:16:15.591786 7efefcda4d00 5 osd.28 pg_epoch: 4312 >>> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 >>> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown >>> NOTIFY] enter Reset >>> -2> 2017-10-19 13:16:15.591799 7efefcda4d00 5 write_log_and_missing >>> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, >>> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: >>> , >>> clear_divergent_priors: 0 >>> -1> 2017-10-19 13:16:15.594757 7efefcda4d00 5 osd.28 pg_epoch: 4306 >>> pg[32.ds0(unlocked)] enter Initial >>> 0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1 >>> >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: >>> In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' >>> thread 7efefcda4d00 time 2017-10-19 13:16:15.594821 >>> >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: >>> 38: FAILED assert(stripe_width % stripe_size == 0) >> >> What does your erasure code profile look like for pool 32? >> >>> >>> >>> On 20-10-2017 01:05, Jean-Charles Lopez wrote: >>>> >>>> Hi, >>>> >>>> have you checked the output of "ceph-disk list” on the nodes where the >>>> OSDs are not coming back on? >>>> >>>> This should give you a hint on what’s going one. >>>> >>>> Also use dmesg to search for any error message >>>> >>>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages >>>> produced by the OSD itself when it starts. >>>> >>>> Regards >>>> JC >>>> >>>>> On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I am not able to start some of the OSDs in the cluster. >>>>> >>>>> This is a test cluster and had 8 OSDs. One node was taken out for >>>>> maintenance. I set the noout flag and after the server came back up I >>>>> unset >>>>> the noout flag. >>>>> >>>>> Suddenly couple of OSDs went down. >>>>> >>>>> And now I can start the OSDs manually from each node, but the status is >>>>> still "down" >>>>> >>>>> $ ceph osd stat >>>>> 8 osds: 2 up, 5 in >>>>> >>>>> >>>>> $ ceph osd tree >>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>>>> -1 7.97388 root default >>>>> -3 1.86469 host a1-osd >>>>> 1 ssd 1.86469 osd.1 down 0 1.00000 >>>>> -5 0.87320 host a2-osd >>>>> 2 ssd 0.87320 osd.2 down 0 1.00000 >>>>> -7 0.87320 host a3-osd >>>>> 4 ssd 0.87320 osd.4 down 1.00000 1.00000 >>>>> -9 0.87320 host a4-osd >>>>> 8 ssd 0.87320 osd.8 up 1.00000 1.00000 >>>>> -11 0.87320 host a5-osd >>>>> 12 ssd 0.87320 osd.12 down 1.00000 1.00000 >>>>> -13 0.87320 host a6-osd >>>>> 17 ssd 0.87320 osd.17 up 1.00000 1.00000 >>>>> -15 0.87320 host a7-osd >>>>> 21 ssd 0.87320 osd.21 down 1.00000 1.00000 >>>>> -17 0.87000 host a8-osd >>>>> 28 ssd 0.87000 osd.28 down 0 1.00000 >>>>> >>>>> Also can see this error in each OSD node. >>>>> >>>>> # systemctl status ceph-osd@1 >>>>> ● ceph-osd@1.service - Ceph object storage daemon osd.1 >>>>> Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; >>>>> enabled; >>>>> vendor preset: disabled) >>>>> Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 >>>>> PDT; 19min ago >>>>> Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} >>>>> --id >>>>> %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT) >>>>> Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh >>>>> --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) >>>>> Main PID: 4163 (code=killed, signal=ABRT) >>>>> >>>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service >>>>> entered failed state. >>>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed. >>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff >>>>> time over, scheduling restart. >>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too >>>>> quickly for ceph-osd@1.service >>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph >>>>> object >>>>> storage daemon osd.1. >>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service >>>>> entered failed state. >>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed. >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > > -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com