After upgrading from 0.94.9 to Jewel 10.2.5 on Ubuntu 14.04 OSDs fail to start with a crash dump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ceph experts,

after updating from ceph 0.94.9 to ceph 10.2.5 on Ubuntu 14.04, 2 out of 3 osd processes are unable to start. On another machine the same happened but only on 1 out of 3 OSDs.

The update procedure is done via ceph-deploy 1.5.37.

Shouldn’t be a permissions problem, because before updating I do a chown 64045: 64045 on the osd disks /dev/sd[bcd] and on the (separate) journal partition on ssd /dev/sda[678]

When upgrade procedure is completed the 3 ceph osd processes are still running, but if I restart them some of them refuses to start.

 

The error in /var/log/ceph/ceph-osd.271.log is full of errors like this :

 

2017-02-13 09:47:17.590843 7fc57248f800  0 set uid:gid to 1001:1001 (ceph:ceph)

2017-02-13 09:47:17.590859 7fc57248f800  0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128

2017-02-13 09:47:17.591356 7fc57248f800  0 pidfile_write: ignore empty --pid-file

2017-02-13 09:47:17.601186 7fc57248f800  0 filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)

2017-02-13 09:47:17.601530 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option

2017-02-13 09:47:17.601539 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option

2017-02-13 09:47:17.601553 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice is supported

2017-02-13 09:47:17.613611 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)

2017-02-13 09:47:17.613673 7fc57248f800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is disabled by conf

2017-02-13 09:47:17.614454 7fc57248f800  1 leveldb: Recovering log #6754

2017-02-13 09:47:17.672544 7fc57248f800  1 leveldb: Delete type=3 #6753

 

2017-02-13 09:47:17.672662 7fc57248f800  1 leveldb: Delete type=0 #6754

 

2017-02-13 09:47:17.673640 7fc57248f800  0 filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled

2017-02-13 09:47:17.684464 7fc57248f800  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello

2017-02-13 09:47:17.688815 7fc57248f800  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan

2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13 09:47:17.692735

osd/OSD.h: 885: FAILED assert(ret)

 

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55ea51744dab]

 2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]

 3: (OSD::init()+0x1ed2) [0x55ea51103872]

 4: (main()+0x29d1) [0x55ea5106ae41]

 5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]

 6: (()+0x355b17) [0x55ea510b3b17]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

--- begin dump of recent events ---

   -29> 2017-02-13 09:47:17.587145 7fc57248f800  5 asok(0x55ea5d1f8280) register_command perfcounters_dump hook 0x55ea5d1d8050

   -28> 2017-02-13 09:47:17.587164 7fc57248f800  5 asok(0x55ea5d1f8280) register_command 1 hook 0x55ea5d1d8050

   -27> 2017-02-13 09:47:17.587166 7fc57248f800  5 asok(0x55ea5d1f8280) register_command perf dump hook 0x55ea5d1d8050

   -26> 2017-02-13 09:47:17.587168 7fc57248f800  5 asok(0x55ea5d1f8280) register_command perfcounters_schema hook 0x55ea5d1d8050

   -25> 2017-02-13 09:47:17.587170 7fc57248f800  5 asok(0x55ea5d1f8280) register_command 2 hook 0x55ea5d1d8050

   -24> 2017-02-13 09:47:17.587172 7fc57248f800  5 asok(0x55ea5d1f8280) register_command perf schema hook 0x55ea5d1d8050

   -23> 2017-02-13 09:47:17.587174 7fc57248f800  5 asok(0x55ea5d1f8280) register_command perf reset hook 0x55ea5d1d8050

   -22> 2017-02-13 09:47:17.587176 7fc57248f800  5 asok(0x55ea5d1f8280) register_command config show hook 0x55ea5d1d8050

   -21> 2017-02-13 09:47:17.587178 7fc57248f800  5 asok(0x55ea5d1f8280) register_command config set hook 0x55ea5d1d8050

   -20> 2017-02-13 09:47:17.587181 7fc57248f800  5 asok(0x55ea5d1f8280) register_command config get hook 0x55ea5d1d8050

   -19> 2017-02-13 09:47:17.587187 7fc57248f800  5 asok(0x55ea5d1f8280) register_command config diff hook 0x55ea5d1d8050

   -18> 2017-02-13 09:47:17.587189 7fc57248f800  5 asok(0x55ea5d1f8280) register_command log flush hook 0x55ea5d1d8050

   -17> 2017-02-13 09:47:17.587191 7fc57248f800  5 asok(0x55ea5d1f8280) register_command log dump hook 0x55ea5d1d8050

   -16> 2017-02-13 09:47:17.587195 7fc57248f800  5 asok(0x55ea5d1f8280) register_command log reopen hook 0x55ea5d1d8050

   -15> 2017-02-13 09:47:17.590843 7fc57248f800  0 set uid:gid to 1001:1001 (ceph:ceph)

   -14> 2017-02-13 09:47:17.590859 7fc57248f800  0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128

   -13> 2017-02-13 09:47:17.591356 7fc57248f800  0 pidfile_write: ignore empty --pid-file

   -12> 2017-02-13 09:47:17.601186 7fc57248f800  0 filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)

   -11> 2017-02-13 09:47:17.601530 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option

   -10> 2017-02-13 09:47:17.601539 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option

    -9> 2017-02-13 09:47:17.601553 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice is supported

    -8> 2017-02-13 09:47:17.613611 7fc57248f800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)

    -7> 2017-02-13 09:47:17.613673 7fc57248f800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is disabled by conf

    -6> 2017-02-13 09:47:17.614454 7fc57248f800  1 leveldb: Recovering log #6754

    -5> 2017-02-13 09:47:17.672544 7fc57248f800  1 leveldb: Delete type=3 #6753

 

    -4> 2017-02-13 09:47:17.672662 7fc57248f800  1 leveldb: Delete type=0 #6754

 

    -3> 2017-02-13 09:47:17.673640 7fc57248f800  0 filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled

    -2> 2017-02-13 09:47:17.684464 7fc57248f800  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello

    -1> 2017-02-13 09:47:17.688815 7fc57248f800  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan

     0> 2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13 09:47:17.692735

osd/OSD.h: 885: FAILED assert(ret)

 

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55ea51744dab]

 2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]

 3: (OSD::init()+0x1ed2) [0x55ea51103872]

 4: (main()+0x29d1) [0x55ea5106ae41]

 5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]

 6: (()+0x355b17) [0x55ea510b3b17]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

--- logging levels ---

   0/ 5 none

   0/ 0 lockdep

   0/ 0 context

   0/ 0 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 0 buffer

   0/ 0 timer

   0/ 0 filer

   0/ 1 striper

   0/ 0 objecter

   0/ 0 rados

   0/ 0 rbd

   0/ 5 rbd_mirror

   0/ 5 rbd_replay

   0/ 0 journaler

   0/ 5 objectcacher

   0/ 0 client

   0/ 0 osd

   0/ 0 optracker

   0/ 0 objclass

   0/ 0 filestore

   0/ 0 journal

   0/ 0 ms

   0/ 0 mon

   0/ 0 monc

   0/ 0 paxos

   0/ 0 tp

   0/ 0 auth

   1/ 5 crypto

   0/ 0 finisher

   0/ 0 heartbeatmap

   0/ 0 perfcounter

   0/ 0 rgw

   1/10 civetweb

   1/ 5 javaclient

   0/ 0 asok

   0/ 0 throttle

   0/ 0 refs

   1/ 5 xio

   1/ 5 compressor

   1/ 5 newstore

   1/ 5 bluestore

   1/ 5 bluefs

   1/ 3 bdev

   1/ 5 kstore

   4/ 5 rocksdb

   4/ 5 leveldb

   1/ 5 kinetic

   1/ 5 fuse

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file /var/log/ceph/ceph-osd.271.log

--- end dump of recent events ---

2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal (Aborted) **

 in thread 7fc57248f800 thread_name:ceph-osd

 

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

 1: (()+0x8f2d32) [0x55ea51650d32]

 2: (()+0x10330) [0x7fc571366330]

 3: (gsignal()+0x37) [0x7fc56f3c5c37]

 4: (abort()+0x148) [0x7fc56f3c9028]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x55ea51744f85]

 6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]

 7: (OSD::init()+0x1ed2) [0x55ea51103872]

 8: (main()+0x29d1) [0x55ea5106ae41]

 9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]

 10: (()+0x355b17) [0x55ea510b3b17]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

--- begin dump of recent events ---

     0> 2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal (Aborted) **

 in thread 7fc57248f800 thread_name:ceph-osd

 

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

 1: (()+0x8f2d32) [0x55ea51650d32]

 2: (()+0x10330) [0x7fc571366330]

 3: (gsignal()+0x37) [0x7fc56f3c5c37]

 4: (abort()+0x148) [0x7fc56f3c9028]

 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x55ea51744f85]

 6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]

 7: (OSD::init()+0x1ed2) [0x55ea51103872]

 8: (main()+0x29d1) [0x55ea5106ae41]

 9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]

 10: (()+0x355b17) [0x55ea510b3b17]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

--- logging levels ---

   0/ 5 none

   0/ 0 lockdep

   0/ 0 context

   0/ 0 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 0 buffer

   0/ 0 timer

   0/ 0 filer

   0/ 1 striper

   0/ 0 objecter

   0/ 0 rados

   0/ 0 rbd

   0/ 5 rbd_mirror

   0/ 5 rbd_replay

   0/ 0 journaler

   0/ 5 objectcacher

   0/ 0 client

   0/ 0 osd

   0/ 0 optracker

   0/ 0 objclass

   0/ 0 filestore

   0/ 0 journal

   0/ 0 ms

   0/ 0 mon

   0/ 0 monc

   0/ 0 paxos

   0/ 0 tp

   0/ 0 auth

   1/ 5 crypto

   0/ 0 finisher

   0/ 0 heartbeatmap

   0/ 0 perfcounter

   0/ 0 rgw

   1/10 civetweb

   1/ 5 javaclient

   0/ 0 asok

   0/ 0 throttle

   0/ 0 refs

   1/ 5 xio

   1/ 5 compressor

   1/ 5 newstore

   1/ 5 bluestore

   1/ 5 bluefs

   1/ 3 bdev

   1/ 5 kstore

   4/ 5 rocksdb

   4/ 5 leveldb

   1/ 5 kinetic

   1/ 5 fuse

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file /var/log/ceph/ceph-osd.271.log

--- end dump of recent events ---

 

 

Removing the osd disks, zapping and recreating them fixes the problem, but I don’t think it’s a good idea to do this for 2/3 of our 300 OSDs.

 

Any idea on:

1.       How to avoid the problem during update

2.       how to fix the failed disks reusing the data

 

Thank you!

 

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux