Safely Upgrading OS on a live Ceph Cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I am attempting an operating system upgrade of a live Ceph cluster. Before I go an screw up my production system, I have been testing on a smaller installation, and I keep running into issues when bringing the Ceph FS metadata server online.

My approach here has been to store all Ceph critical files on non-root partitions, so the OS install can safely proceed without overwriting any of the Ceph configuration or data.

Here is how I proceed:

First I bring down the Ceph FS via `ceph mds cluster_down`.
Second, to prevent OSDs from trying to repair data, I run `ceph osd set noout`
Finally I stop the ceph processes in the following order: ceph-mds, ceph-mon, ceph-osd

Note my cluster has 1 mds and 1 mon, and 7 osd.

I then install the new OS and then bring the cluster back up by walking the steps in reverse:

First I start the ceph processes in the following order: ceph-osd, ceph-mon, ceph-mds
Second I restore OSD functionality with `ceph osd unset noout`
Finally I bring up the Ceph FS via `ceph mds cluster_up`

Everything works smoothly except the Ceph FS bring up. The MDS starts in the active:replay state and eventually crashes with the following backtrace:

starting mds.cuba at :/0
2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors {default=true}
2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory
mds/SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
mds/SessionMap.cc: 98: FAILED assert(0 == "failed to load sessionmap")
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b]
2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
5: (()+0x8192) [0x7f31d9c8f192]
6: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
mds/SessionMap.cc: 98: FAILED assert(0 == "failed to load sessionmap")

ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b]
2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
5: (()+0x8192) [0x7f31d9c8f192]
6: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 -106> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors {default=true}
   -1> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory
    0> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
mds/SessionMap.cc: 98: FAILED assert(0 == "failed to load sessionmap")

ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b]
2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
5: (()+0x8192) [0x7f31d9c8f192]
6: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
in thread 7f31d30df700
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) **
in thread 7f31d30df700

ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    0> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) **
in thread 7f31d30df700

ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

How can I safely stop a Ceph cluster, so that it will cleanly start back up again?

-Chris

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux