Re: Antw: Safely Upgrading OS on a live Ceph Cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In my case the version will be identical. But I might have to do this node by node approach if I can't stabilize the more general shutdown/bring-up approach. There are 192 OSD in my cluster, so it will take a while to go node by node unfortunately.

-Chris

> On Mar 1, 2017, at 2:50 AM, Steffen Weißgerber <WeissgerberS@xxxxxxx> wrote:
> 
> Hello,
> 
> some time ago I upgraded our 6 node cluster (0.94.9) running on Ubuntu from Trusty
> to Xenial.
> 
> The problem here was that with the os update also ceph is upgraded what we did not want
> in the same step because then we had to upgrade all nodes at the same time.
> 
> Therefore we did it node by node first freeing the osd's on the node with setting the weight to 0.
> 
> After os update, configuring the right ceph version for our setup and testing the reboot so that
> all components start up correctly we set the osd weights to the normal value so that the
> cluster was rebalancing.
> 
> With this procedure the cluster was always up.
> 
> Regards
> 
> Steffen
> 
> 
>>>> "Heller, Chris" <cheller@xxxxxxxxxx> schrieb am Montag, 27. Februar 2017 um
> 18:01:
>> I am attempting an operating system upgrade of a live Ceph cluster. Before I 
>> go an screw up my production system, I have been testing on a smaller 
>> installation, and I keep running into issues when bringing the Ceph FS 
>> metadata server online.
>> 
>> My approach here has been to store all Ceph critical files on non-root 
>> partitions, so the OS install can safely proceed without overwriting any of 
>> the Ceph configuration or data.
>> 
>> Here is how I proceed:
>> 
>> First I bring down the Ceph FS via `ceph mds cluster_down`.
>> Second, to prevent OSDs from trying to repair data, I run `ceph osd set 
>> noout`
>> Finally I stop the ceph processes in the following order: ceph-mds, ceph-mon, 
>> ceph-osd
>> 
>> Note my cluster has 1 mds and 1 mon, and 7 osd.
>> 
>> I then install the new OS and then bring the cluster back up by walking the 
>> steps in reverse:
>> 
>> First I start the ceph processes in the following order: ceph-osd, ceph-mon, 
>> ceph-mds
>> Second I restore OSD functionality with `ceph osd unset noout`
>> Finally I bring up the Ceph FS via `ceph mds cluster_up`
>> 
>> Everything works smoothly except the Ceph FS bring up. The MDS starts in the 
>> active:replay state and eventually crashes with the following backtrace:
>> 
>> starting mds.cuba at :/0
>> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors 
>> {default=true}
>> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got 
>> (2) No such file or directory
>> mds/SessionMap.cc <http://sessionmap.cc/>: In function 'void 
>> SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 
>> 2017-02-27 16:56:08.537739
>> mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to 
>> load sessionmap")
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x8b) [0x98bb4b]
>> 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 5: (()+0x8192) [0x7f31d9c8f192]
>> 6: (clone()+0x6d) [0x7f31d919c51d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
>> interpret this.
>> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc 
>> <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, 
>> ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
>> mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to 
>> load sessionmap")
>> 
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x8b) [0x98bb4b]
>> 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 5: (()+0x8192) [0x7f31d9c8f192]
>> 6: (clone()+0x6d) [0x7f31d919c51d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
>> interpret this.
>> 
>> -106> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors 
>> {default=true}
>>   -1> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish 
>> got (2) No such file or directory
>>    0> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc 
>> <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, 
>> ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
>> mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to 
>> load sessionmap")
>> 
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x8b) [0x98bb4b]
>> 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 5: (()+0x8192) [0x7f31d9c8f192]
>> 6: (clone()+0x6d) [0x7f31d919c51d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
>> interpret this.
>> 
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>> *** Caught signal (Aborted) **
>> in thread 7f31d30df700
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: ceph_mds() [0x89984a]
>> 2: (()+0x10350) [0x7f31d9c97350]
>> 3: (gsignal()+0x39) [0x7f31d90d8c49]
>> 4: (abort()+0x148) [0x7f31d90dc058]
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
>> 6: (()+0x5e6f6) [0x7f31d99e16f6]
>> 7: (()+0x5e723) [0x7f31d99e1723]
>> 8: (()+0x5e942) [0x7f31d99e1942]
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x278) [0x98bd38]
>> 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 13: (()+0x8192) [0x7f31d9c8f192]
>> 14: (clone()+0x6d) [0x7f31d919c51d]
>> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) **
>> in thread 7f31d30df700
>> 
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: ceph_mds() [0x89984a]
>> 2: (()+0x10350) [0x7f31d9c97350]
>> 3: (gsignal()+0x39) [0x7f31d90d8c49]
>> 4: (abort()+0x148) [0x7f31d90dc058]
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
>> 6: (()+0x5e6f6) [0x7f31d99e16f6]
>> 7: (()+0x5e723) [0x7f31d99e1723]
>> 8: (()+0x5e942) [0x7f31d99e1942]
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x278) [0x98bd38]
>> 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 13: (()+0x8192) [0x7f31d9c8f192]
>> 14: (clone()+0x6d) [0x7f31d919c51d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
>> interpret this.
>> 
>>    0> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) **
>> in thread 7f31d30df700
>> 
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: ceph_mds() [0x89984a]
>> 2: (()+0x10350) [0x7f31d9c97350]
>> 3: (gsignal()+0x39) [0x7f31d90d8c49]
>> 4: (abort()+0x148) [0x7f31d90dc058]
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
>> 6: (()+0x5e6f6) [0x7f31d99e16f6]
>> 7: (()+0x5e723) [0x7f31d99e1723]
>> 8: (()+0x5e942) [0x7f31d99e1942]
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x278) [0x98bd38]
>> 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 13: (()+0x8192) [0x7f31d9c8f192]
>> 14: (clone()+0x6d) [0x7f31d919c51d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
>> interpret this.
>> 
>> How can I safely stop a Ceph cluster, so that it will cleanly start back up 
>> again?
>> 
>> -Chris
> 
> -- 
> Klinik-Service Neubrandenburg GmbH
> Allendestr. 30, 17036 Neubrandenburg
> Amtsgericht Neubrandenburg, HRB 2457
> Geschaeftsfuehrerin: Gudrun Kappich
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux