Hi,
>> What does your erasure code profile look like for pool 32?
$ ceph osd erasure-code-profile get myprofile
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8
On 20-10-2017 06:52, Brad Hubbard wrote:
On Fri, Oct 20, 2017 at 6:32 AM, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,
have you checked the output of "ceph-disk list” on the nodes where the
OSDs are not coming back on?
Yes, it shows all the disk correctly mounted.
And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
produced by the OSD itself when it starts.
This is the error messages seen in one of the OSD log file. Even though the
service is starting the status shows as down itself.
=============================
-7> 2017-10-19 13:16:15.589465 7efefcda4d00 5 osd.28 pg_epoch: 4312
pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown
NOTIFY] enter Reset
-6> 2017-10-19 13:16:15.589476 7efefcda4d00 5 write_log_and_missing
with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
-5> 2017-10-19 13:16:15.591629 7efefcda4d00 5 osd.28 pg_epoch: 4312
pg[33.10(unlocked)] enter Initial
-4> 2017-10-19 13:16:15.591759 7efefcda4d00 5 osd.28 pg_epoch: 4312
pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
NOTIFY] exit Initial 0.000130 0 0.000000
-3> 2017-10-19 13:16:15.591786 7efefcda4d00 5 osd.28 pg_epoch: 4312
pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
NOTIFY] enter Reset
-2> 2017-10-19 13:16:15.591799 7efefcda4d00 5 write_log_and_missing
with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
-1> 2017-10-19 13:16:15.594757 7efefcda4d00 5 osd.28 pg_epoch: 4306
pg[32.ds0(unlocked)] enter Initial
0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
38: FAILED assert(stripe_width % stripe_size == 0)
What does your erasure code profile look like for pool 32?
On 20-10-2017 01:05, Jean-Charles Lopez wrote:
Hi,
have you checked the output of "ceph-disk list” on the nodes where the
OSDs are not coming back on?
This should give you a hint on what’s going one.
Also use dmesg to search for any error message
And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
produced by the OSD itself when it starts.
Regards
JC
On Oct 19, 2017, at 12:11, Josy <josy@xxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,
I am not able to start some of the OSDs in the cluster.
This is a test cluster and had 8 OSDs. One node was taken out for
maintenance. I set the noout flag and after the server came back up I unset
the noout flag.
Suddenly couple of OSDs went down.
And now I can start the OSDs manually from each node, but the status is
still "down"
$ ceph osd stat
8 osds: 2 up, 5 in
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.97388 root default
-3 1.86469 host a1-osd
1 ssd 1.86469 osd.1 down 0 1.00000
-5 0.87320 host a2-osd
2 ssd 0.87320 osd.2 down 0 1.00000
-7 0.87320 host a3-osd
4 ssd 0.87320 osd.4 down 1.00000 1.00000
-9 0.87320 host a4-osd
8 ssd 0.87320 osd.8 up 1.00000 1.00000
-11 0.87320 host a5-osd
12 ssd 0.87320 osd.12 down 1.00000 1.00000
-13 0.87320 host a6-osd
17 ssd 0.87320 osd.17 up 1.00000 1.00000
-15 0.87320 host a7-osd
21 ssd 0.87320 osd.21 down 1.00000 1.00000
-17 0.87000 host a8-osd
28 ssd 0.87000 osd.28 down 0 1.00000
Also can see this error in each OSD node.
# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18
PDT; 19min ago
Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id
%i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
Main PID: 4163 (code=killed, signal=ABRT)
Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
entered failed state.
Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff
time over, scheduling restart.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too
quickly for ceph-osd@1.service
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object
storage daemon osd.1.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service
entered failed state.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com