Hi,
>> have you checked the output of "ceph-disk list” on the nodes where
the OSDs are not coming back on?
Yes, it shows all the disk correctly mounted.
>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
produced by the OSD itself when it starts.
This is the error messages seen in one of the OSD log file. Even though
the service is starting the status shows as down itself.
=============================
-7> 2017-10-19 13:16:15.589465 7efefcda4d00 5 osd.28 pg_epoch: 4312
pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown
NOTIFY] enter Reset
-6> 2017-10-19 13:16:15.589476 7efefcda4d00 5
write_log_and_missing with: dirty_to: 0'0, dirty_from:
4294967295'18446744073709551615, writeout_from:
4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
-5> 2017-10-19 13:16:15.591629 7efefcda4d00 5 osd.28 pg_epoch:
4312 pg[33.10(unlocked)] enter Initial
-4> 2017-10-19 13:16:15.591759 7efefcda4d00 5 osd.28 pg_epoch:
4312 pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c
4270/4270 les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0
crt=0'0 unknown NOTIFY] exit Initial 0.000130 0 0.000000
-3> 2017-10-19 13:16:15.591786 7efefcda4d00 5 osd.28 pg_epoch:
4312 pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c
4270/4270 les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0
crt=0'0 unknown NOTIFY] enter Reset
-2> 2017-10-19 13:16:15.591799 7efefcda4d00 5
write_log_and_missing with: dirty_to: 0'0, dirty_from:
4294967295'18446744073709551615, writeout_from:
4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
-1> 2017-10-19 13:16:15.594757 7efefcda4d00 5 osd.28 pg_epoch:
4306 pg[32.ds0(unlocked)] enter Initial
0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
38: FAILED assert(stripe_width % stripe_size == 0)
On 20-10-2017 01:05, Jean-Charles Lopez wrote:
Hi,
have you checked the output of "ceph-disk list” on the nodes where the OSDs are not coming back on?
This should give you a hint on what’s going one.
Also use dmesg to search for any error message
And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages produced by the OSD itself when it starts.
Regards
JC
On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,
I am not able to start some of the OSDs in the cluster.
This is a test cluster and had 8 OSDs. One node was taken out for maintenance. I set the noout flag and after the server came back up I unset the noout flag.
Suddenly couple of OSDs went down.
And now I can start the OSDs manually from each node, but the status is still "down"
$ ceph osd stat
8 osds: 2 up, 5 in
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.97388 root default
-3 1.86469 host a1-osd
1 ssd 1.86469 osd.1 down 0 1.00000
-5 0.87320 host a2-osd
2 ssd 0.87320 osd.2 down 0 1.00000
-7 0.87320 host a3-osd
4 ssd 0.87320 osd.4 down 1.00000 1.00000
-9 0.87320 host a4-osd
8 ssd 0.87320 osd.8 up 1.00000 1.00000
-11 0.87320 host a5-osd
12 ssd 0.87320 osd.12 down 1.00000 1.00000
-13 0.87320 host a6-osd
17 ssd 0.87320 osd.17 up 1.00000 1.00000
-15 0.87320 host a7-osd
21 ssd 0.87320 osd.21 down 1.00000 1.00000
-17 0.87000 host a8-osd
28 ssd 0.87000 osd.28 down 0 1.00000
Also can see this error in each OSD node.
# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 PDT; 19min ago
Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
Main PID: 4163 (code=killed, signal=ABRT)
Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered failed state.
Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff time over, scheduling restart.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too quickly for ceph-osd@1.service
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object storage daemon osd.1.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered failed state.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com