Re: Upgrading 2K OSDs from Hammer to Jewel. Our experience

Brad Hubbard <bhubbard@xxxxxxxxxx> · Sun, 12 Mar 2017 19:52:12 +1000

On Sun, Mar 12, 2017 at 6:36 AM, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> Hi,
>
> thanks for that report! Glad to hear a mostly happy report. I’m still on the
> fence … ;)
>
> I have had reports that Qemu (librbd connections) will require
> updates/restarts before upgrading. What was your experience on that side?
> Did you upgrade the clients? Did you start using any of the new RBD
> features, like fast diff?

You don't need to restart qemu-kvm instances *before* upgrading but
you do need to restart or migrate them *after* updating. The updated
binaries are only loaded into the qemu process address space at
start-up so to load the newly installed binaries (libraries) you need
to restart or do a migration to an upgraded host.

>
> What’s your experience with load/performance after the upgrade? Found any
> new issues that indicate shifted hotspots?
>
> Cheers and thanks again,
> Christian
>
> On Mar 11, 2017, at 12:21 PM, cephmailinglist@xxxxxxxxx wrote:
>
> Hello list,
>
> A week ago we upgraded our Ceph clusters from Hammer to Jewel and with this
> email we want to share our experiences.
>
>
> We have four clusters:
>
> 1) Test cluster for all the fun things, completely virtual.
>
> 2) Test cluster for Openstack: 3 monitors and 9 OSDs, all baremetal
>
> 3) Cluster where we store backups: 3 monitors and 153 OSDs. 554 TB storage
>
> 4) Main cluster (used for our custom software stack and openstack): 5
> monitors and 1917 OSDs. 8 PB storage
>
>
> All the clusters are running on Ubuntu 14.04 LTS and we use the Ceph
> packages from ceph.com. On every cluster we upgraded the monitors first and
> after that, the OSDs. Our backup cluster is the only cluster that also
> serves S3 via the RadosGW and that service is upgraded at the same time as
> the OSDs in that cluster. The upgrade of clusters 1, 2 and 3 went without
> any problem, just an apt-get upgrade on every component. We did  see the
> message "failed to encode map e<version> with expected crc", but that
> message disappeared when all the OSDs where upgraded.
>
> The upgrade of our biggest cluster, nr 4, did not go without problems. Since
> we where expecting a lot of "failed to encode map e<version> with expected
> crc" messages, we disabled clog to monitors with 'ceph tell osd.* injectargs
> -- --clog_to_monitors=false' so our monitors would not choke in those
> messages. The upgrade of the monitors did go as expected, without any
> problem, the problems started when we started the upgrade of the OSDs. In
> the upgrade procedure, we had to change the ownership of the files from root
> to the user ceph and that process was taking so long on our cluster that
> completing the upgrade would take more then a week. We decided to keep the
> permissions as they where for now, so in the upstart init script
> /etc/init/ceph-osd.conf, we changed '--setuser ceph --setgroup ceph' to
> '--setuser root --setgroup root' and fix that OSD by OSD after the upgrade
> was completely done
>
> On cluster 3 (backup) we could change the permissions in a shorter time with
> the following procedure:
>
>     a) apt-get -y install ceph-common
>     b) mount|egrep 'on \/var.*ceph.*osd'|awk '{print $3}'|while read P; do
> echo chown -R ceph:ceph $P \&;done > t ; bash t ; rm t
>     c) (wait for all the chown's to complete)
>     d) stop ceph-all
>     e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph
>     f) start ceph-all
>
> This procedure did not work on our main (4) cluster because the load on the
> OSDs became 100% in step b and that resulted in blocked I/O on some virtual
> instances in the Openstack cluster. Also at that time one of our pools got a
> lot of extra data, those files where stored with root permissions since we
> did not restarted the Ceph daemons yet, the 'find' in step e found so much
> files that xargs (the shell) could not handle it (too many arguments). At
> that time we decided to keep the permissions on root in the upgrade phase.
>
> The next and biggest problem we encountered had to do with the CRC errors on
> the OSD map. On every map update, the OSDs that were not upgraded yet, got
> that CRC error and asked the monitor for a full OSD map instead of just a
> delta update. At first we did not understand what exactly happened, we ran
> the upgrade per node using a script and in that script we watch the state of
> the cluster and when the cluster is healthy again, we upgrade the next host.
> Every time we started the script (skipping the already upgraded hosts) the
> first host(s) upgraded without issues and then we got blocked I/O on the
> cluster. The blocked I/O went away within a minute of 2 (not measured).
> After investigation we found out that the blocked I/O happened when nodes
> where asking the monitor for a (full) OSD map and that resulted shortly in a
> full saturated network link on our monitor.
>
> In the next graph the statistics for one of our Ceph monitor is shown. Our
> hosts are equipped with 10 gbit/s NIC's and every time at the highest peaks,
> the problems occurred. We could work around this problem by waiting four
> minutes between every host and after that time (14:20) we did not have any
> issues any more. Of course the number of not upgraded OSDs decreased, so the
> number of full OSD map requests also got smaller in time.
>
>
> <mon0_network_hammer_to_jewel_upgrade.png>
>
>
> The day after the upgrade we had issues with live migrations of Openstack
> instances. We got this message, "OSError: /usr/lib/librbd.so.1: undefined
> symbol: _ZN8librados5Rados15aio_watch_flushEPNS_13AioCompletionE". This is
> resolved by restarting libvirt-bin and nova-compute on every compute node.
>
> Please notice that the upgrade of our biggest cluster was not a 100%
> success, but the problems where relative small and the cluster stayed
> on-line and there where only a few virtual openstack instances that did not
> like the blocked I/O and had to be restarted.
>
>
> --
>
> With regards,
>
> Richard Arends.
> Snow BV / http://snow.nl
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian.
> Zagrodnick
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com