Hi Ceph experts, after updating from ceph 0.94.9 to ceph 10.2.5 on Ubuntu 14.04, 2 out of 3 osd processes are unable to start. On another machine the same happened but only on 1 out of 3 OSDs. The update procedure is done via ceph-deploy 1.5.37. Shouldn’t be a permissions problem, because before updating I do a chown 64045: 64045 on the osd disks /dev/sd[bcd] and on the (separate) journal partition on ssd /dev/sda[678] When upgrade procedure is completed the 3 ceph osd processes are still running, but if I restart them some of them refuses to start.
The error in /var/log/ceph/ceph-osd.271.log is full of errors like this :
2017-02-13 09:47:17.590843 7fc57248f800 0 set uid:gid to 1001:1001 (ceph:ceph) 2017-02-13 09:47:17.590859 7fc57248f800 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128 2017-02-13 09:47:17.591356 7fc57248f800 0 pidfile_write: ignore empty --pid-file 2017-02-13 09:47:17.601186 7fc57248f800 0 filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342) 2017-02-13 09:47:17.601530 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2017-02-13 09:47:17.601539 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option 2017-02-13 09:47:17.601553 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice is supported 2017-02-13 09:47:17.613611 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2017-02-13 09:47:17.613673 7fc57248f800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is disabled by conf 2017-02-13 09:47:17.614454 7fc57248f800 1 leveldb: Recovering log #6754 2017-02-13 09:47:17.672544 7fc57248f800 1 leveldb: Delete type=3 #6753
2017-02-13 09:47:17.672662 7fc57248f800 1 leveldb: Delete type=0 #6754
2017-02-13 09:47:17.673640 7fc57248f800 0 filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2017-02-13 09:47:17.684464 7fc57248f800 0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello 2017-02-13 09:47:17.688815 7fc57248f800 0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan 2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13 09:47:17.692735 osd/OSD.h: 885: FAILED assert(ret)
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55ea51744dab] 2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd] 3: (OSD::init()+0x1ed2) [0x55ea51103872] 4: (main()+0x29d1) [0x55ea5106ae41] 5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45] 6: (()+0x355b17) [0x55ea510b3b17] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events --- -29> 2017-02-13 09:47:17.587145 7fc57248f800 5 asok(0x55ea5d1f8280) register_command perfcounters_dump hook 0x55ea5d1d8050 -28> 2017-02-13 09:47:17.587164 7fc57248f800 5 asok(0x55ea5d1f8280) register_command 1 hook 0x55ea5d1d8050 -27> 2017-02-13 09:47:17.587166 7fc57248f800 5 asok(0x55ea5d1f8280) register_command perf dump hook 0x55ea5d1d8050 -26> 2017-02-13 09:47:17.587168 7fc57248f800 5 asok(0x55ea5d1f8280) register_command perfcounters_schema hook 0x55ea5d1d8050 -25> 2017-02-13 09:47:17.587170 7fc57248f800 5 asok(0x55ea5d1f8280) register_command 2 hook 0x55ea5d1d8050 -24> 2017-02-13 09:47:17.587172 7fc57248f800 5 asok(0x55ea5d1f8280) register_command perf schema hook 0x55ea5d1d8050 -23> 2017-02-13 09:47:17.587174 7fc57248f800 5 asok(0x55ea5d1f8280) register_command perf reset hook 0x55ea5d1d8050 -22> 2017-02-13 09:47:17.587176 7fc57248f800 5 asok(0x55ea5d1f8280) register_command config show hook 0x55ea5d1d8050 -21> 2017-02-13 09:47:17.587178 7fc57248f800 5 asok(0x55ea5d1f8280) register_command config set hook 0x55ea5d1d8050 -20> 2017-02-13 09:47:17.587181 7fc57248f800 5 asok(0x55ea5d1f8280) register_command config get hook 0x55ea5d1d8050 -19> 2017-02-13 09:47:17.587187 7fc57248f800 5 asok(0x55ea5d1f8280) register_command config diff hook 0x55ea5d1d8050 -18> 2017-02-13 09:47:17.587189 7fc57248f800 5 asok(0x55ea5d1f8280) register_command log flush hook 0x55ea5d1d8050 -17> 2017-02-13 09:47:17.587191 7fc57248f800 5 asok(0x55ea5d1f8280) register_command log dump hook 0x55ea5d1d8050 -16> 2017-02-13 09:47:17.587195 7fc57248f800 5 asok(0x55ea5d1f8280) register_command log reopen hook 0x55ea5d1d8050 -15> 2017-02-13 09:47:17.590843 7fc57248f800 0 set uid:gid to 1001:1001 (ceph:ceph) -14> 2017-02-13 09:47:17.590859 7fc57248f800 0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128 -13> 2017-02-13 09:47:17.591356 7fc57248f800 0 pidfile_write: ignore empty --pid-file -12> 2017-02-13 09:47:17.601186 7fc57248f800 0 filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342) -11> 2017-02-13 09:47:17.601530 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option -10> 2017-02-13 09:47:17.601539 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option -9> 2017-02-13 09:47:17.601553 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice is supported -8> 2017-02-13 09:47:17.613611 7fc57248f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) -7> 2017-02-13 09:47:17.613673 7fc57248f800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is disabled by conf -6> 2017-02-13 09:47:17.614454 7fc57248f800 1 leveldb: Recovering log #6754 -5> 2017-02-13 09:47:17.672544 7fc57248f800 1 leveldb: Delete type=3 #6753
-4> 2017-02-13 09:47:17.672662 7fc57248f800 1 leveldb: Delete type=0 #6754
-3> 2017-02-13 09:47:17.673640 7fc57248f800 0 filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled -2> 2017-02-13 09:47:17.684464 7fc57248f800 0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello -1> 2017-02-13 09:47:17.688815 7fc57248f800 0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan 0> 2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13 09:47:17.692735 osd/OSD.h: 885: FAILED assert(ret)
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55ea51744dab] 2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd] 3: (OSD::init()+0x1ed2) [0x55ea51103872] 4: (main()+0x29d1) [0x55ea5106ae41] 5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45] 6: (()+0x355b17) [0x55ea510b3b17] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 0 journaler 0/ 5 objectcacher 0/ 0 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/10 civetweb 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.271.log --- end dump of recent events --- 2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal (Aborted) ** in thread 7fc57248f800 thread_name:ceph-osd
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) 1: (()+0x8f2d32) [0x55ea51650d32] 2: (()+0x10330) [0x7fc571366330] 3: (gsignal()+0x37) [0x7fc56f3c5c37] 4: (abort()+0x148) [0x7fc56f3c9028] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x55ea51744f85] 6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd] 7: (OSD::init()+0x1ed2) [0x55ea51103872] 8: (main()+0x29d1) [0x55ea5106ae41] 9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45] 10: (()+0x355b17) [0x55ea510b3b17] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events --- 0> 2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal (Aborted) ** in thread 7fc57248f800 thread_name:ceph-osd
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) 1: (()+0x8f2d32) [0x55ea51650d32] 2: (()+0x10330) [0x7fc571366330] 3: (gsignal()+0x37) [0x7fc56f3c5c37] 4: (abort()+0x148) [0x7fc56f3c9028] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x55ea51744f85] 6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd] 7: (OSD::init()+0x1ed2) [0x55ea51103872] 8: (main()+0x29d1) [0x55ea5106ae41] 9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45] 10: (()+0x355b17) [0x55ea510b3b17] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 0 journaler 0/ 5 objectcacher 0/ 0 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/10 civetweb 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.271.log --- end dump of recent events ---
Removing the osd disks, zapping and recreating them fixes the problem, but I don’t think it’s a good idea to do this for 2/3 of our 300 OSDs.
Any idea on: 1. How to avoid the problem during update 2. how to fix the failed disks reusing the data
Thank you!
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com