Re: Not able to start OSD

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 20 Oct 2017 11:22:22 +1000

On Fri, Oct 20, 2017 at 6:32 AM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
>>> have you checked the output of "ceph-disk list” on the nodes where the
>>> OSDs are not coming back on?
>
> Yes, it shows all the disk correctly mounted.
>
>>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>>> produced by the OSD itself when it starts.
>
> This is the error messages seen in one of the OSD log file. Even though the
> service is starting the status shows as down itself.
>
>
> =============================
>
>    -7> 2017-10-19 13:16:15.589465 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown
> NOTIFY] enter Reset
>     -6> 2017-10-19 13:16:15.589476 7efefcda4d00  5 write_log_and_missing
> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
> clear_divergent_priors: 0
>     -5> 2017-10-19 13:16:15.591629 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.10(unlocked)] enter Initial
>     -4> 2017-10-19 13:16:15.591759 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
> NOTIFY] exit Initial 0.000130 0 0.000000
>     -3> 2017-10-19 13:16:15.591786 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
> NOTIFY] enter Reset
>     -2> 2017-10-19 13:16:15.591799 7efefcda4d00  5 write_log_and_missing
> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
> clear_divergent_priors: 0
>     -1> 2017-10-19 13:16:15.594757 7efefcda4d00  5 osd.28 pg_epoch: 4306
> pg[32.ds0(unlocked)] enter Initial
>      0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
> In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
> thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
> 38: FAILED assert(stripe_width % stripe_size == 0)

What does your erasure code profile look like for pool 32?

>
>
>
> On 20-10-2017 01:05, Jean-Charles Lopez wrote:
>>
>> Hi,
>>
>> have you checked the output of "ceph-disk list” on the nodes where the
>> OSDs are not coming back on?
>>
>> This should give you a hint on what’s going one.
>>
>> Also use dmesg to search for any error message
>>
>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>> produced by the OSD itself when it starts.
>>
>> Regards
>> JC
>>
>>> On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> I am not able to start some of the OSDs in the cluster.
>>>
>>> This is a test cluster and had 8 OSDs. One node was taken out for
>>> maintenance. I set the noout flag and after the server came back up I unset
>>> the noout flag.
>>>
>>> Suddenly couple of OSDs went down.
>>>
>>> And now I can start the OSDs manually from each node, but the status is
>>> still "down"
>>>
>>> $  ceph osd stat
>>> 8 osds: 2 up, 5 in
>>>
>>>
>>> $ ceph osd tree
>>> ID  CLASS WEIGHT  TYPE NAME                 STATUS REWEIGHT PRI-AFF
>>>   -1       7.97388 root default
>>>   -3       1.86469     host a1-osd
>>>    1   ssd 1.86469         osd.1               down        0 1.00000
>>>   -5       0.87320     host a2-osd
>>>    2   ssd 0.87320         osd.2               down        0 1.00000
>>>   -7       0.87320     host a3-osd
>>>    4   ssd 0.87320         osd.4               down  1.00000 1.00000
>>>   -9       0.87320     host a4-osd
>>>    8   ssd 0.87320         osd.8                 up  1.00000 1.00000
>>> -11       0.87320     host a5-osd
>>>   12   ssd 0.87320         osd.12              down  1.00000 1.00000
>>> -13       0.87320     host a6-osd
>>>   17   ssd 0.87320         osd.17                up  1.00000 1.00000
>>> -15       0.87320     host a7-osd
>>>   21   ssd 0.87320         osd.21              down  1.00000 1.00000
>>> -17       0.87000     host a8-osd
>>>   28   ssd 0.87000         osd.28              down        0 1.00000
>>>
>>> Also can see this error in each OSD node.
>>>
>>> # systemctl status ceph-osd@1
>>> ● ceph-osd@1.service - Ceph object storage daemon osd.1
>>>     Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
>>> vendor preset: disabled)
>>>     Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18
>>> PDT; 19min ago
>>>    Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id
>>> %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
>>>    Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
>>> --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>>>   Main PID: 4163 (code=killed, signal=ABRT)
>>>
>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
>>> entered failed state.
>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff
>>> time over, scheduling restart.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too
>>> quickly for ceph-osd@1.service
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object
>>> storage daemon osd.1.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
>>> entered failed state.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com