Re: Not able to start OSD

Josy <josy@xxxxxxxxxxxxxxxxxxxxx> · Fri, 20 Oct 2017 15:05:14 +0530

Hi,

>> What does your erasure code profile look like for pool 32?

$ ceph osd erasure-code-profile get myprofile
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8

On 20-10-2017 06:52, Brad Hubbard wrote:
On Fri, Oct 20, 2017 at 6:32 AM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

have you checked the output of "ceph-disk list” on the nodes where the
OSDs are not coming back on?
Yes, it shows all the disk correctly mounted.

And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
produced by the OSD itself when it starts.
This is the error messages seen in one of the OSD log file. Even though the
service is starting the status shows as down itself.

=============================

    -7> 2017-10-19 13:16:15.589465 7efefcda4d00  5 osd.28 pg_epoch: 4312
pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown
NOTIFY] enter Reset
     -6> 2017-10-19 13:16:15.589476 7efefcda4d00  5 write_log_and_missing
with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
     -5> 2017-10-19 13:16:15.591629 7efefcda4d00  5 osd.28 pg_epoch: 4312
pg[33.10(unlocked)] enter Initial
     -4> 2017-10-19 13:16:15.591759 7efefcda4d00  5 osd.28 pg_epoch: 4312
pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
NOTIFY] exit Initial 0.000130 0 0.000000
     -3> 2017-10-19 13:16:15.591786 7efefcda4d00  5 osd.28 pg_epoch: 4312
pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
NOTIFY] enter Reset
     -2> 2017-10-19 13:16:15.591799 7efefcda4d00  5 write_log_and_missing
with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
     -1> 2017-10-19 13:16:15.594757 7efefcda4d00  5 osd.28 pg_epoch: 4306
pg[32.ds0(unlocked)] enter Initial
      0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
38: FAILED assert(stripe_width % stripe_size == 0)
What does your erasure code profile look like for pool 32?

On 20-10-2017 01:05, Jean-Charles Lopez wrote:
Hi,

have you checked the output of "ceph-disk list” on the nodes where the
OSDs are not coming back on?

This should give you a hint on what’s going one.

Also use dmesg to search for any error message

And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
produced by the OSD itself when it starts.

Regards
JC

On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:

Hi,

I am not able to start some of the OSDs in the cluster.

This is a test cluster and had 8 OSDs. One node was taken out for
maintenance. I set the noout flag and after the server came back up I unset
the noout flag.

Suddenly couple of OSDs went down.

And now I can start the OSDs manually from each node, but the status is
still "down"

$  ceph osd stat
8 osds: 2 up, 5 in

$ ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                 STATUS REWEIGHT PRI-AFF
   -1       7.97388 root default
   -3       1.86469     host a1-osd
    1   ssd 1.86469         osd.1               down        0 1.00000
   -5       0.87320     host a2-osd
    2   ssd 0.87320         osd.2               down        0 1.00000
   -7       0.87320     host a3-osd
    4   ssd 0.87320         osd.4               down  1.00000 1.00000
   -9       0.87320     host a4-osd
    8   ssd 0.87320         osd.8                 up  1.00000 1.00000
-11       0.87320     host a5-osd
   12   ssd 0.87320         osd.12              down  1.00000 1.00000
-13       0.87320     host a6-osd
   17   ssd 0.87320         osd.17                up  1.00000 1.00000
-15       0.87320     host a7-osd
   21   ssd 0.87320         osd.21              down  1.00000 1.00000
-17       0.87000     host a8-osd
   28   ssd 0.87000         osd.28              down        0 1.00000

Also can see this error in each OSD node.

# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
vendor preset: disabled)
     Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18
PDT; 19min ago
    Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id
%i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
    Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
   Main PID: 4163 (code=killed, signal=ABRT)

Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
entered failed state.
Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff
time over, scheduling restart.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too
quickly for ceph-osd@1.service
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object
storage daemon osd.1.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
entered failed state.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com