Hello, some time ago I upgraded our 6 node cluster (0.94.9) running on Ubuntu from Trusty to Xenial. The problem here was that with the os update also ceph is upgraded what we did not want in the same step because then we had to upgrade all nodes at the same time. Therefore we did it node by node first freeing the osd's on the node with setting the weight to 0. After os update, configuring the right ceph version for our setup and testing the reboot so that all components start up correctly we set the osd weights to the normal value so that the cluster was rebalancing. With this procedure the cluster was always up. Regards Steffen >>> "Heller, Chris" <cheller@xxxxxxxxxx> schrieb am Montag, 27. Februar 2017 um 18:01: > I am attempting an operating system upgrade of a live Ceph cluster. Before I > go an screw up my production system, I have been testing on a smaller > installation, and I keep running into issues when bringing the Ceph FS > metadata server online. > > My approach here has been to store all Ceph critical files on non-root > partitions, so the OS install can safely proceed without overwriting any of > the Ceph configuration or data. > > Here is how I proceed: > > First I bring down the Ceph FS via `ceph mds cluster_down`. > Second, to prevent OSDs from trying to repair data, I run `ceph osd set > noout` > Finally I stop the ceph processes in the following order: ceph-mds, ceph-mon, > ceph-osd > > Note my cluster has 1 mds and 1 mon, and 7 osd. > > I then install the new OS and then bring the cluster back up by walking the > steps in reverse: > > First I start the ceph processes in the following order: ceph-osd, ceph-mon, > ceph-mds > Second I restore OSD functionality with `ceph osd unset noout` > Finally I bring up the Ceph FS via `ceph mds cluster_up` > > Everything works smoothly except the Ceph FS bring up. The MDS starts in the > active:replay state and eventually crashes with the following backtrace: > > starting mds.cuba at :/0 > 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors > {default=true} > 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got > (2) No such file or directory > mds/SessionMap.cc <http://sessionmap.cc/>: In function 'void > SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time > 2017-02-27 16:56:08.537739 > mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to > load sessionmap") > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x98bb4b] > 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] > 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] > 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] > 5: (()+0x8192) [0x7f31d9c8f192] > 6: (clone()+0x6d) [0x7f31d919c51d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc > <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, > ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739 > mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to > load sessionmap") > > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x98bb4b] > 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] > 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] > 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] > 5: (()+0x8192) [0x7f31d9c8f192] > 6: (clone()+0x6d) [0x7f31d919c51d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > -106> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors > {default=true} > -1> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish > got (2) No such file or directory > 0> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc > <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, > ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739 > mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to > load sessionmap") > > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x98bb4b] > 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] > 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] > 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] > 5: (()+0x8192) [0x7f31d9c8f192] > 6: (clone()+0x6d) [0x7f31d919c51d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > terminate called after throwing an instance of 'ceph::FailedAssertion' > *** Caught signal (Aborted) ** > in thread 7f31d30df700 > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: ceph_mds() [0x89984a] > 2: (()+0x10350) [0x7f31d9c97350] > 3: (gsignal()+0x39) [0x7f31d90d8c49] > 4: (abort()+0x148) [0x7f31d90dc058] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] > 6: (()+0x5e6f6) [0x7f31d99e16f6] > 7: (()+0x5e723) [0x7f31d99e1723] > 8: (()+0x5e942) [0x7f31d99e1942] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x278) [0x98bd38] > 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] > 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] > 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] > 13: (()+0x8192) [0x7f31d9c8f192] > 14: (clone()+0x6d) [0x7f31d919c51d] > 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) ** > in thread 7f31d30df700 > > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: ceph_mds() [0x89984a] > 2: (()+0x10350) [0x7f31d9c97350] > 3: (gsignal()+0x39) [0x7f31d90d8c49] > 4: (abort()+0x148) [0x7f31d90dc058] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] > 6: (()+0x5e6f6) [0x7f31d99e16f6] > 7: (()+0x5e723) [0x7f31d99e1723] > 8: (()+0x5e942) [0x7f31d99e1942] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x278) [0x98bd38] > 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] > 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] > 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] > 13: (()+0x8192) [0x7f31d9c8f192] > 14: (clone()+0x6d) [0x7f31d919c51d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > 0> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) ** > in thread 7f31d30df700 > > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > 1: ceph_mds() [0x89984a] > 2: (()+0x10350) [0x7f31d9c97350] > 3: (gsignal()+0x39) [0x7f31d90d8c49] > 4: (abort()+0x148) [0x7f31d90dc058] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] > 6: (()+0x5e6f6) [0x7f31d99e16f6] > 7: (()+0x5e723) [0x7f31d99e1723] > 8: (()+0x5e942) [0x7f31d99e1942] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x278) [0x98bd38] > 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] > 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] > 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] > 13: (()+0x8192) [0x7f31d9c8f192] > 14: (clone()+0x6d) [0x7f31d919c51d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > How can I safely stop a Ceph cluster, so that it will cleanly start back up > again? > > -Chris -- Klinik-Service Neubrandenburg GmbH Allendestr. 30, 17036 Neubrandenburg Amtsgericht Neubrandenburg, HRB 2457 Geschaeftsfuehrerin: Gudrun Kappich _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com