Re: After upgrading from 0.94.9 to Jewel 10.2.5 on Ubuntu 14.04 OSDs fail to start with a crash dump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Capture a log with debug_osd at 30 (yes, that's correct, 30) and see
if that sheds more light on the issue.

On Tue, Feb 14, 2017 at 6:53 AM, Alfredo Colangelo
<acolangelo1@xxxxxxxxx> wrote:
> Hi Ceph experts,
>
> after updating from ceph 0.94.9 to ceph 10.2.5 on Ubuntu 14.04, 2 out of 3
> osd processes are unable to start. On another machine the same happened but
> only on 1 out of 3 OSDs.
>
> The update procedure is done via ceph-deploy 1.5.37.
>
> Shouldn’t be a permissions problem, because before updating I do a chown
> 64045: 64045 on the osd disks /dev/sd[bcd] and on the (separate) journal
> partition on ssd /dev/sda[678]
>
> When upgrade procedure is completed the 3 ceph osd processes are still
> running, but if I restart them some of them refuses to start.
>
>
>
> The error in /var/log/ceph/ceph-osd.271.log is full of errors like this :
>
>
>
> 2017-02-13 09:47:17.590843 7fc57248f800  0 set uid:gid to 1001:1001
> (ceph:ceph)
>
> 2017-02-13 09:47:17.590859 7fc57248f800  0 ceph version 10.2.5
> (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128
>
> 2017-02-13 09:47:17.591356 7fc57248f800  0 pidfile_write: ignore empty
> --pid-file
>
> 2017-02-13 09:47:17.601186 7fc57248f800  0
> filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)
>
> 2017-02-13 09:47:17.601530 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config option
>
> 2017-02-13 09:47:17.601539 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
> SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
>
> 2017-02-13 09:47:17.601553 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice
> is supported
>
> 2017-02-13 09:47:17.613611 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
> syncfs(2) syscall fully supported (by glibc and kernel)
>
> 2017-02-13 09:47:17.613673 7fc57248f800  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is
> disabled by conf
>
> 2017-02-13 09:47:17.614454 7fc57248f800  1 leveldb: Recovering log #6754
>
> 2017-02-13 09:47:17.672544 7fc57248f800  1 leveldb: Delete type=3 #6753
>
>
>
> 2017-02-13 09:47:17.672662 7fc57248f800  1 leveldb: Delete type=0 #6754
>
>
>
> 2017-02-13 09:47:17.673640 7fc57248f800  0
> filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal
> mode: checkpoint is not enabled
>
> 2017-02-13 09:47:17.684464 7fc57248f800  0 <cls> cls/hello/cls_hello.cc:305:
> loading cls_hello
>
> 2017-02-13 09:47:17.688815 7fc57248f800  0 <cls>
> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>
> 2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef
> OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13
> 09:47:17.692735
>
> osd/OSD.h: 885: FAILED assert(ret)
>
>
>
>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x55ea51744dab]
>
>  2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
>
>  3: (OSD::init()+0x1ed2) [0x55ea51103872]
>
>  4: (main()+0x29d1) [0x55ea5106ae41]
>
>  5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
>
>  6: (()+0x355b17) [0x55ea510b3b17]
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
>
> --- begin dump of recent events ---
>
>    -29> 2017-02-13 09:47:17.587145 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perfcounters_dump hook 0x55ea5d1d8050
>
>    -28> 2017-02-13 09:47:17.587164 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command 1 hook 0x55ea5d1d8050
>
>    -27> 2017-02-13 09:47:17.587166 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perf dump hook 0x55ea5d1d8050
>
>    -26> 2017-02-13 09:47:17.587168 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perfcounters_schema hook 0x55ea5d1d8050
>
>    -25> 2017-02-13 09:47:17.587170 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command 2 hook 0x55ea5d1d8050
>
>    -24> 2017-02-13 09:47:17.587172 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perf schema hook 0x55ea5d1d8050
>
>    -23> 2017-02-13 09:47:17.587174 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perf reset hook 0x55ea5d1d8050
>
>    -22> 2017-02-13 09:47:17.587176 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config show hook 0x55ea5d1d8050
>
>    -21> 2017-02-13 09:47:17.587178 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config set hook 0x55ea5d1d8050
>
>    -20> 2017-02-13 09:47:17.587181 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config get hook 0x55ea5d1d8050
>
>    -19> 2017-02-13 09:47:17.587187 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config diff hook 0x55ea5d1d8050
>
>    -18> 2017-02-13 09:47:17.587189 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command log flush hook 0x55ea5d1d8050
>
>    -17> 2017-02-13 09:47:17.587191 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command log dump hook 0x55ea5d1d8050
>
>    -16> 2017-02-13 09:47:17.587195 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command log reopen hook 0x55ea5d1d8050
>
>    -15> 2017-02-13 09:47:17.590843 7fc57248f800  0 set uid:gid to 1001:1001
> (ceph:ceph)
>
>    -14> 2017-02-13 09:47:17.590859 7fc57248f800  0 ceph version 10.2.5
> (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128
>
>    -13> 2017-02-13 09:47:17.591356 7fc57248f800  0 pidfile_write: ignore
> empty --pid-file
>
>    -12> 2017-02-13 09:47:17.601186 7fc57248f800  0
> filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)
>
>    -11> 2017-02-13 09:47:17.601530 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config option
>
>    -10> 2017-02-13 09:47:17.601539 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
> SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
>
>     -9> 2017-02-13 09:47:17.601553 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice
> is supported
>
>     -8> 2017-02-13 09:47:17.613611 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
> syncfs(2) syscall fully supported (by glibc and kernel)
>
>     -7> 2017-02-13 09:47:17.613673 7fc57248f800  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is
> disabled by conf
>
>     -6> 2017-02-13 09:47:17.614454 7fc57248f800  1 leveldb: Recovering log
> #6754
>
>     -5> 2017-02-13 09:47:17.672544 7fc57248f800  1 leveldb: Delete type=3
> #6753
>
>
>
>     -4> 2017-02-13 09:47:17.672662 7fc57248f800  1 leveldb: Delete type=0
> #6754
>
>
>
>     -3> 2017-02-13 09:47:17.673640 7fc57248f800  0
> filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal
> mode: checkpoint is not enabled
>
>     -2> 2017-02-13 09:47:17.684464 7fc57248f800  0 <cls>
> cls/hello/cls_hello.cc:305: loading cls_hello
>
>     -1> 2017-02-13 09:47:17.688815 7fc57248f800  0 <cls>
> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>
>      0> 2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function
> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13
> 09:47:17.692735
>
> osd/OSD.h: 885: FAILED assert(ret)
>
>
>
>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x55ea51744dab]
>
>  2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
>
>  3: (OSD::init()+0x1ed2) [0x55ea51103872]
>
>  4: (main()+0x29d1) [0x55ea5106ae41]
>
>  5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
>
>  6: (()+0x355b17) [0x55ea510b3b17]
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
>
> --- logging levels ---
>
>    0/ 5 none
>
>    0/ 0 lockdep
>
>    0/ 0 context
>
>    0/ 0 crush
>
>    1/ 5 mds
>
>    1/ 5 mds_balancer
>
>    1/ 5 mds_locker
>
>    1/ 5 mds_log
>
>    1/ 5 mds_log_expire
>
>    1/ 5 mds_migrator
>
>    0/ 0 buffer
>
>    0/ 0 timer
>
>    0/ 0 filer
>
>    0/ 1 striper
>
>    0/ 0 objecter
>
>    0/ 0 rados
>
>    0/ 0 rbd
>
>    0/ 5 rbd_mirror
>
>    0/ 5 rbd_replay
>
>    0/ 0 journaler
>
>    0/ 5 objectcacher
>
>    0/ 0 client
>
>    0/ 0 osd
>
>    0/ 0 optracker
>
>    0/ 0 objclass
>
>    0/ 0 filestore
>
>    0/ 0 journal
>
>    0/ 0 ms
>
>    0/ 0 mon
>
>    0/ 0 monc
>
>    0/ 0 paxos
>
>    0/ 0 tp
>
>    0/ 0 auth
>
>    1/ 5 crypto
>
>    0/ 0 finisher
>
>    0/ 0 heartbeatmap
>
>    0/ 0 perfcounter
>
>    0/ 0 rgw
>
>    1/10 civetweb
>
>    1/ 5 javaclient
>
>    0/ 0 asok
>
>    0/ 0 throttle
>
>    0/ 0 refs
>
>    1/ 5 xio
>
>    1/ 5 compressor
>
>    1/ 5 newstore
>
>    1/ 5 bluestore
>
>    1/ 5 bluefs
>
>    1/ 3 bdev
>
>    1/ 5 kstore
>
>    4/ 5 rocksdb
>
>    4/ 5 leveldb
>
>    1/ 5 kinetic
>
>    1/ 5 fuse
>
>   -2/-2 (syslog threshold)
>
>   -1/-1 (stderr threshold)
>
>   max_recent     10000
>
>   max_new         1000
>
>   log_file /var/log/ceph/ceph-osd.271.log
>
> --- end dump of recent events ---
>
> 2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal (Aborted) **
>
>  in thread 7fc57248f800 thread_name:ceph-osd
>
>
>
>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
>  1: (()+0x8f2d32) [0x55ea51650d32]
>
>  2: (()+0x10330) [0x7fc571366330]
>
>  3: (gsignal()+0x37) [0x7fc56f3c5c37]
>
>  4: (abort()+0x148) [0x7fc56f3c9028]
>
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x265) [0x55ea51744f85]
>
>  6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
>
>  7: (OSD::init()+0x1ed2) [0x55ea51103872]
>
>  8: (main()+0x29d1) [0x55ea5106ae41]
>
>  9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
>
>  10: (()+0x355b17) [0x55ea510b3b17]
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
>
> --- begin dump of recent events ---
>
>      0> 2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal
> (Aborted) **
>
>  in thread 7fc57248f800 thread_name:ceph-osd
>
>
>
>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
>  1: (()+0x8f2d32) [0x55ea51650d32]
>
>  2: (()+0x10330) [0x7fc571366330]
>
>  3: (gsignal()+0x37) [0x7fc56f3c5c37]
>
>  4: (abort()+0x148) [0x7fc56f3c9028]
>
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x265) [0x55ea51744f85]
>
>  6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
>
>  7: (OSD::init()+0x1ed2) [0x55ea51103872]
>
>  8: (main()+0x29d1) [0x55ea5106ae41]
>
>  9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
>
>  10: (()+0x355b17) [0x55ea510b3b17]
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
>
> --- logging levels ---
>
>    0/ 5 none
>
>    0/ 0 lockdep
>
>    0/ 0 context
>
>    0/ 0 crush
>
>    1/ 5 mds
>
>    1/ 5 mds_balancer
>
>    1/ 5 mds_locker
>
>    1/ 5 mds_log
>
>    1/ 5 mds_log_expire
>
>    1/ 5 mds_migrator
>
>    0/ 0 buffer
>
>    0/ 0 timer
>
>    0/ 0 filer
>
>    0/ 1 striper
>
>    0/ 0 objecter
>
>    0/ 0 rados
>
>    0/ 0 rbd
>
>    0/ 5 rbd_mirror
>
>    0/ 5 rbd_replay
>
>    0/ 0 journaler
>
>    0/ 5 objectcacher
>
>    0/ 0 client
>
>    0/ 0 osd
>
>    0/ 0 optracker
>
>    0/ 0 objclass
>
>    0/ 0 filestore
>
>    0/ 0 journal
>
>    0/ 0 ms
>
>    0/ 0 mon
>
>    0/ 0 monc
>
>    0/ 0 paxos
>
>    0/ 0 tp
>
>    0/ 0 auth
>
>    1/ 5 crypto
>
>    0/ 0 finisher
>
>    0/ 0 heartbeatmap
>
>    0/ 0 perfcounter
>
>    0/ 0 rgw
>
>    1/10 civetweb
>
>    1/ 5 javaclient
>
>    0/ 0 asok
>
>    0/ 0 throttle
>
>    0/ 0 refs
>
>    1/ 5 xio
>
>    1/ 5 compressor
>
>    1/ 5 newstore
>
>    1/ 5 bluestore
>
>    1/ 5 bluefs
>
>    1/ 3 bdev
>
>    1/ 5 kstore
>
>    4/ 5 rocksdb
>
>    4/ 5 leveldb
>
>    1/ 5 kinetic
>
>    1/ 5 fuse
>
>   -2/-2 (syslog threshold)
>
>   -1/-1 (stderr threshold)
>
>   max_recent     10000
>
>   max_new         1000
>
>   log_file /var/log/ceph/ceph-osd.271.log
>
> --- end dump of recent events ---
>
>
>
>
>
> Removing the osd disks, zapping and recreating them fixes the problem, but I
> don’t think it’s a good idea to do this for 2/3 of our 300 OSDs.
>
>
>
> Any idea on:
>
> 1.       How to avoid the problem during update
>
> 2.       how to fix the failed disks reusing the data
>
>
>
> Thank you!
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux