Re: Upgrading 2K OSDs from Hammer to Jewel. Our experience

Christian Balzer <chibi@xxxxxxx> · Mon, 13 Mar 2017 09:29:22 +0900

Hello,

On Sun, 12 Mar 2017 19:52:12 +1000 Brad Hubbard wrote:

> On Sun, Mar 12, 2017 at 6:36 AM, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> > Hi,
> >
> > thanks for that report! Glad to hear a mostly happy report. I’m still on the
> > fence … ;)
> >
> > I have had reports that Qemu (librbd connections) will require
> > updates/restarts before upgrading. What was your experience on that side?
> > Did you upgrade the clients? Did you start using any of the new RBD
> > features, like fast diff?  
> 
> You don't need to restart qemu-kvm instances *before* upgrading but
> you do need to restart or migrate them *after* updating. The updated
> binaries are only loaded into the qemu process address space at
> start-up so to load the newly installed binaries (libraries) you need
> to restart or do a migration to an upgraded host.
> 

Well, the OP wrote about live migration problems, but those were not in the
qemu part of things but libvirt/openstack related.

To wit, I did upgrade a test cluster from hammer to Jewel and live
migration under ganeti worked fine.

I've also not seen any problems on other instances that since have not
been restarted, nor would I hope that an upgrade from one stable version
to the next should EVER require such a step (at least immediately). 

Christian

> >
> > What’s your experience with load/performance after the upgrade? Found any
> > new issues that indicate shifted hotspots?
> >
> > Cheers and thanks again,
> > Christian
> >
> > On Mar 11, 2017, at 12:21 PM, cephmailinglist@xxxxxxxxx wrote:
> >
> > Hello list,
> >
> > A week ago we upgraded our Ceph clusters from Hammer to Jewel and with this
> > email we want to share our experiences.
> >
> >
> > We have four clusters:
> >
> > 1) Test cluster for all the fun things, completely virtual.
> >
> > 2) Test cluster for Openstack: 3 monitors and 9 OSDs, all baremetal
> >
> > 3) Cluster where we store backups: 3 monitors and 153 OSDs. 554 TB storage
> >
> > 4) Main cluster (used for our custom software stack and openstack): 5
> > monitors and 1917 OSDs. 8 PB storage
> >
> >
> > All the clusters are running on Ubuntu 14.04 LTS and we use the Ceph
> > packages from ceph.com. On every cluster we upgraded the monitors first and
> > after that, the OSDs. Our backup cluster is the only cluster that also
> > serves S3 via the RadosGW and that service is upgraded at the same time as
> > the OSDs in that cluster. The upgrade of clusters 1, 2 and 3 went without
> > any problem, just an apt-get upgrade on every component. We did  see the
> > message "failed to encode map e<version> with expected crc", but that
> > message disappeared when all the OSDs where upgraded.
> >
> > The upgrade of our biggest cluster, nr 4, did not go without problems. Since
> > we where expecting a lot of "failed to encode map e<version> with expected
> > crc" messages, we disabled clog to monitors with 'ceph tell osd.* injectargs
> > -- --clog_to_monitors=false' so our monitors would not choke in those
> > messages. The upgrade of the monitors did go as expected, without any
> > problem, the problems started when we started the upgrade of the OSDs. In
> > the upgrade procedure, we had to change the ownership of the files from root
> > to the user ceph and that process was taking so long on our cluster that
> > completing the upgrade would take more then a week. We decided to keep the
> > permissions as they where for now, so in the upstart init script
> > /etc/init/ceph-osd.conf, we changed '--setuser ceph --setgroup ceph' to
> > '--setuser root --setgroup root' and fix that OSD by OSD after the upgrade
> > was completely done
> >
> > On cluster 3 (backup) we could change the permissions in a shorter time with
> > the following procedure:
> >
> >     a) apt-get -y install ceph-common
> >     b) mount|egrep 'on \/var.*ceph.*osd'|awk '{print $3}'|while read P; do
> > echo chown -R ceph:ceph $P \&;done > t ; bash t ; rm t
> >     c) (wait for all the chown's to complete)
> >     d) stop ceph-all
> >     e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph
> >     f) start ceph-all
> >
> > This procedure did not work on our main (4) cluster because the load on the
> > OSDs became 100% in step b and that resulted in blocked I/O on some virtual
> > instances in the Openstack cluster. Also at that time one of our pools got a
> > lot of extra data, those files where stored with root permissions since we
> > did not restarted the Ceph daemons yet, the 'find' in step e found so much
> > files that xargs (the shell) could not handle it (too many arguments). At
> > that time we decided to keep the permissions on root in the upgrade phase.
> >
> > The next and biggest problem we encountered had to do with the CRC errors on
> > the OSD map. On every map update, the OSDs that were not upgraded yet, got
> > that CRC error and asked the monitor for a full OSD map instead of just a
> > delta update. At first we did not understand what exactly happened, we ran
> > the upgrade per node using a script and in that script we watch the state of
> > the cluster and when the cluster is healthy again, we upgrade the next host.
> > Every time we started the script (skipping the already upgraded hosts) the
> > first host(s) upgraded without issues and then we got blocked I/O on the
> > cluster. The blocked I/O went away within a minute of 2 (not measured).
> > After investigation we found out that the blocked I/O happened when nodes
> > where asking the monitor for a (full) OSD map and that resulted shortly in a
> > full saturated network link on our monitor.
> >
> > In the next graph the statistics for one of our Ceph monitor is shown. Our
> > hosts are equipped with 10 gbit/s NIC's and every time at the highest peaks,
> > the problems occurred. We could work around this problem by waiting four
> > minutes between every host and after that time (14:20) we did not have any
> > issues any more. Of course the number of not upgraded OSDs decreased, so the
> > number of full OSD map requests also got smaller in time.
> >
> >
> > <mon0_network_hammer_to_jewel_upgrade.png>
> >
> >
> > The day after the upgrade we had issues with live migrations of Openstack
> > instances. We got this message, "OSError: /usr/lib/librbd.so.1: undefined
> > symbol: _ZN8librados5Rados15aio_watch_flushEPNS_13AioCompletionE". This is
> > resolved by restarting libvirt-bin and nova-compute on every compute node.
> >
> > Please notice that the upgrade of our biggest cluster was not a 100%
> > success, but the problems where relative small and the cluster stayed
> > on-line and there where only a few virtual openstack instances that did not
> > like the blocked I/O and had to be restarted.
> >
> >
> > --
> >
> > With regards,
> >
> > Richard Arends.
> > Snow BV / http://snow.nl
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > --
> > Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> > Flying Circus Internet Operations GmbH · http://flyingcircus.io
> > Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> > HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian.
> > Zagrodnick
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com