Re: Not able to start OSD

Brad Hubbard <bhubbard@xxxxxxxxxx> · Sat, 21 Oct 2017 09:58:10 +1000

On Fri, Oct 20, 2017 at 7:35 PM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
>>> What does your erasure code profile look like for pool 32?
>
> $ ceph osd erasure-code-profile get myprofile
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=5
> m=3
> plugin=jerasure
> technique=reed_sol_van
> w=8

Sorry, can you post the output of 'ceph osd dump' as well please?

>
>
> On 20-10-2017 06:52, Brad Hubbard wrote:
>>
>> On Fri, Oct 20, 2017 at 6:32 AM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>>>> have you checked the output of "ceph-disk list” on the nodes where the
>>>>> OSDs are not coming back on?
>>>
>>> Yes, it shows all the disk correctly mounted.
>>>
>>>>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>>>>> produced by the OSD itself when it starts.
>>>
>>> This is the error messages seen in one of the OSD log file. Even though
>>> the
>>> service is starting the status shows as down itself.
>>>
>>>
>>> =============================
>>>
>>>     -7> 2017-10-19 13:16:15.589465 7efefcda4d00  5 osd.28 pg_epoch: 4312
>>> pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
>>> les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown
>>> NOTIFY] enter Reset
>>>      -6> 2017-10-19 13:16:15.589476 7efefcda4d00  5 write_log_and_missing
>>> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
>>> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups:
>>> ,
>>> clear_divergent_priors: 0
>>>      -5> 2017-10-19 13:16:15.591629 7efefcda4d00  5 osd.28 pg_epoch: 4312
>>> pg[33.10(unlocked)] enter Initial
>>>      -4> 2017-10-19 13:16:15.591759 7efefcda4d00  5 osd.28 pg_epoch: 4312
>>> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
>>> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
>>> NOTIFY] exit Initial 0.000130 0 0.000000
>>>      -3> 2017-10-19 13:16:15.591786 7efefcda4d00  5 osd.28 pg_epoch: 4312
>>> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
>>> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
>>> NOTIFY] enter Reset
>>>      -2> 2017-10-19 13:16:15.591799 7efefcda4d00  5 write_log_and_missing
>>> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
>>> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups:
>>> ,
>>> clear_divergent_priors: 0
>>>      -1> 2017-10-19 13:16:15.594757 7efefcda4d00  5 osd.28 pg_epoch: 4306
>>> pg[32.ds0(unlocked)] enter Initial
>>>       0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1
>>>
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
>>> In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
>>> thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
>>>
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
>>> 38: FAILED assert(stripe_width % stripe_size == 0)
>>
>> What does your erasure code profile look like for pool 32?
>>
>>>
>>>
>>> On 20-10-2017 01:05, Jean-Charles Lopez wrote:
>>>>
>>>> Hi,
>>>>
>>>> have you checked the output of "ceph-disk list” on the nodes where the
>>>> OSDs are not coming back on?
>>>>
>>>> This should give you a hint on what’s going one.
>>>>
>>>> Also use dmesg to search for any error message
>>>>
>>>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>>>> produced by the OSD itself when it starts.
>>>>
>>>> Regards
>>>> JC
>>>>
>>>>> On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am not able to start some of the OSDs in the cluster.
>>>>>
>>>>> This is a test cluster and had 8 OSDs. One node was taken out for
>>>>> maintenance. I set the noout flag and after the server came back up I
>>>>> unset
>>>>> the noout flag.
>>>>>
>>>>> Suddenly couple of OSDs went down.
>>>>>
>>>>> And now I can start the OSDs manually from each node, but the status is
>>>>> still "down"
>>>>>
>>>>> $  ceph osd stat
>>>>> 8 osds: 2 up, 5 in
>>>>>
>>>>>
>>>>> $ ceph osd tree
>>>>> ID  CLASS WEIGHT  TYPE NAME                 STATUS REWEIGHT PRI-AFF
>>>>>    -1       7.97388 root default
>>>>>    -3       1.86469     host a1-osd
>>>>>     1   ssd 1.86469         osd.1               down        0 1.00000
>>>>>    -5       0.87320     host a2-osd
>>>>>     2   ssd 0.87320         osd.2               down        0 1.00000
>>>>>    -7       0.87320     host a3-osd
>>>>>     4   ssd 0.87320         osd.4               down  1.00000 1.00000
>>>>>    -9       0.87320     host a4-osd
>>>>>     8   ssd 0.87320         osd.8                 up  1.00000 1.00000
>>>>> -11       0.87320     host a5-osd
>>>>>    12   ssd 0.87320         osd.12              down  1.00000 1.00000
>>>>> -13       0.87320     host a6-osd
>>>>>    17   ssd 0.87320         osd.17                up  1.00000 1.00000
>>>>> -15       0.87320     host a7-osd
>>>>>    21   ssd 0.87320         osd.21              down  1.00000 1.00000
>>>>> -17       0.87000     host a8-osd
>>>>>    28   ssd 0.87000         osd.28              down        0 1.00000
>>>>>
>>>>> Also can see this error in each OSD node.
>>>>>
>>>>> # systemctl status ceph-osd@1
>>>>> ● ceph-osd@1.service - Ceph object storage daemon osd.1
>>>>>      Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service;
>>>>> enabled;
>>>>> vendor preset: disabled)
>>>>>      Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18
>>>>> PDT; 19min ago
>>>>>     Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER}
>>>>> --id
>>>>> %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
>>>>>     Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
>>>>> --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>>>>>    Main PID: 4163 (code=killed, signal=ABRT)
>>>>>
>>>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
>>>>> entered failed state.
>>>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
>>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff
>>>>> time over, scheduling restart.
>>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too
>>>>> quickly for ceph-osd@1.service
>>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph
>>>>> object
>>>>> storage daemon osd.1.
>>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
>>>>> entered failed state.
>>>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com