I am attempting an operating system upgrade of a live Ceph cluster. Before I go an screw up my production system, I have been testing on a smaller installation, and I keep running into issues when bringing the Ceph FS metadata server online.
My approach here has been to store all Ceph critical files on non-root partitions, so the OS install can safely proceed without overwriting any of the Ceph configuration or data.
First I bring down the Ceph FS via `ceph mds cluster_down`.
Finally I stop the ceph processes in the following order: ceph-mds, ceph-mon, ceph-osd
Note my cluster has 1 mds and 1 mon, and 7 osd.
I then install the new OS and then bring the cluster back up by walking the steps in reverse:
First I start the ceph processes in the following order: ceph-osd, ceph-mon, ceph-mds
Everything works smoothly except the Ceph FS bring up. The MDS starts in the active:replay state and eventually crashes with the following backtrace:
starting mds.cuba at :/0
2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors {default=true}
2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory
mds/
SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
mds/
SessionMap.cc: 98: FAILED assert(0 == "failed to load sessionmap")
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b]
2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
5: (()+0x8192) [0x7f31d9c8f192]
6: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/
SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
mds/
SessionMap.cc: 98: FAILED assert(0 == "failed to load sessionmap")
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b]
2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
5: (()+0x8192) [0x7f31d9c8f192]
6: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-106> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors {default=true}
-1> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory
0> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/
SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
mds/
SessionMap.cc: 98: FAILED assert(0 == "failed to load sessionmap")
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b]
2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
5: (()+0x8192) [0x7f31d9c8f192]
6: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
in thread 7f31d30df700
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) **
in thread 7f31d30df700
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) **
in thread 7f31d30df700
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
How can I safely stop a Ceph cluster, so that it will cleanly start back up again?