Re: Not able to start OSD

Josy <josy@xxxxxxxxxxxxxxxxxxxxx> · Fri, 20 Oct 2017 02:02:30 +0530

Hi,

>> have you checked the output of "ceph-disk list” on the nodes where 
the OSDs are not coming back on?

Yes, it shows all the disk correctly mounted.

>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages 
produced by the OSD itself when it starts.

This is the error messages seen in one of the OSD log file. Even though 
the service is starting the status shows as down itself.

=============================

   -7> 2017-10-19 13:16:15.589465 7efefcda4d00  5 osd.28 pg_epoch: 4312 
pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 
les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown 
NOTIFY] enter Reset
    -6> 2017-10-19 13:16:15.589476 7efefcda4d00  5 
write_log_and_missing with: dirty_to: 0'0, dirty_from: 
4294967295'18446744073709551615, writeout_from: 
4294967295'18446744073709551615, trimmed: , trimmed_dups: , 
clear_divergent_priors: 0
    -5> 2017-10-19 13:16:15.591629 7efefcda4d00  5 osd.28 pg_epoch: 
4312 pg[33.10(unlocked)] enter Initial
    -4> 2017-10-19 13:16:15.591759 7efefcda4d00  5 osd.28 pg_epoch: 
4312 pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 
4270/4270 les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 
crt=0'0 unknown NOTIFY] exit Initial 0.000130 0 0.000000
    -3> 2017-10-19 13:16:15.591786 7efefcda4d00  5 osd.28 pg_epoch: 
4312 pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 
4270/4270 les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 
crt=0'0 unknown NOTIFY] enter Reset
    -2> 2017-10-19 13:16:15.591799 7efefcda4d00  5 
write_log_and_missing with: dirty_to: 0'0, dirty_from: 
4294967295'18446744073709551615, writeout_from: 
4294967295'18446744073709551615, trimmed: , trimmed_dups: , 
clear_divergent_priors: 0
    -1> 2017-10-19 13:16:15.594757 7efefcda4d00  5 osd.28 pg_epoch: 
4306 pg[32.ds0(unlocked)] enter Initial
     0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: 
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' 
thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: 
38: FAILED assert(stripe_width % stripe_size == 0)

On 20-10-2017 01:05, Jean-Charles Lopez wrote:
Hi,

have you checked the output of "ceph-disk list” on the nodes where the OSDs are not coming back on?

This should give you a hint on what’s going one.

Also use dmesg to search for any error message

And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages produced by the OSD itself when it starts.

Regards
JC

On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:

Hi,

I am not able to start some of the OSDs in the cluster.

This is a test cluster and had 8 OSDs. One node was taken out for maintenance. I set the noout flag and after the server came back up I unset the noout flag.

Suddenly couple of OSDs went down.

And now I can start the OSDs manually from each node, but the status is still "down"

$  ceph osd stat
8 osds: 2 up, 5 in

$ ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                 STATUS REWEIGHT PRI-AFF
  -1       7.97388 root default
  -3       1.86469     host a1-osd
   1   ssd 1.86469         osd.1               down        0 1.00000
  -5       0.87320     host a2-osd
   2   ssd 0.87320         osd.2               down        0 1.00000
  -7       0.87320     host a3-osd
   4   ssd 0.87320         osd.4               down  1.00000 1.00000
  -9       0.87320     host a4-osd
   8   ssd 0.87320         osd.8                 up  1.00000 1.00000
-11       0.87320     host a5-osd
  12   ssd 0.87320         osd.12              down  1.00000 1.00000
-13       0.87320     host a6-osd
  17   ssd 0.87320         osd.17                up  1.00000 1.00000
-15       0.87320     host a7-osd
  21   ssd 0.87320         osd.21              down  1.00000 1.00000
-17       0.87000     host a8-osd
  28   ssd 0.87000         osd.28              down        0 1.00000

Also can see this error in each OSD node.

# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
    Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
    Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 PDT; 19min ago
   Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
  Main PID: 4163 (code=killed, signal=ABRT)

Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered failed state.
Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff time over, scheduling restart.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too quickly for ceph-osd@1.service
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object storage daemon osd.1.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered failed state.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com