Re: Upgrading 2K OSDs from Hammer to Jewel. Our experience

Christian Theune <ct@xxxxxxxxxxxxxxx> · Sat, 11 Mar 2017 21:36:35 +0100

Hi,
thanks for that report! Glad to hear a mostly happy report. I’m still on the fence … ;)

I have had reports that Qemu (librbd connections) will require updates/restarts before upgrading. What was your experience on that side? Did you upgrade the clients? Did you start using any of the new RBD features, like fast diff?

What’s your experience with load/performance after the upgrade? Found any new issues that indicate shifted hotspots?

Cheers and thanks again,
Christian

On Mar 11, 2017, at 12:21 PM, cephmailinglist@xxxxxxxxx wrote:

  Hello list,
A week ago we upgraded our Ceph clusters from Hammer to Jewel and
      with this email we want to share our experiences. 

We have four clusters:
1) Test cluster for all the fun things, completely virtual.
2) Test cluster for Openstack: 3 monitors and 9 OSDs, all
      baremetal
3) Cluster where we store backups: 3 monitors and 153 OSDs. 554
      TB storage
4) Main cluster (used for our custom software stack and
      openstack): 5 monitors and 1917 OSDs. 8 PB storage

All the clusters are running on Ubuntu 14.04 LTS and we use the
      Ceph packages from ceph.com. On every cluster we upgraded the
      monitors first and after that, the OSDs. Our backup cluster is the
      only cluster that also serves S3 via the RadosGW and that service
      is upgraded at the same time as the OSDs in that cluster. The
      upgrade of clusters 1, 2 and 3 went without any problem, just an
      apt-get upgrade on every component. We did  see the message
      "failed to encode map e<version> with expected crc", but
      that message disappeared when all the OSDs where upgraded. 

The upgrade of our biggest cluster, nr 4, did not go without
      problems. Since we where expecting a lot of "failed to encode map
      e<version> with expected crc" messages, we disabled clog to
      monitors with 'ceph tell osd.* injectargs --
      --clog_to_monitors=false' so our monitors would not choke in those
      messages. The upgrade of the monitors did go as expected, without
      any problem, the problems started when we started the upgrade of
      the OSDs. In the upgrade procedure, we had to change the ownership
      of the files from root to the user ceph and that process was
      taking so long on our cluster that completing the upgrade would
      take more then a week. We decided to keep the permissions as they
      where for now, so in the upstart init script
      /etc/init/ceph-osd.conf, we changed '--setuser ceph --setgroup
      ceph' to  '--setuser root --setgroup root' and fix that OSD by OSD
      after the upgrade was completely done
On cluster 3 (backup) we could change the permissions in a
      shorter time with the following procedure:

    a) apt-get -y install ceph-common

          b) mount|egrep 'on \/var.*ceph.*osd'|awk '{print $3}'|while
      read P; do echo chown -R ceph:ceph $P \&;done > t ; bash t
      ; rm t

          c) (wait for all the chown's to complete)

          d) stop ceph-all

          e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown
      ceph:ceph

          f) start ceph-all
This procedure did not work on our main (4) cluster because the
      load on the OSDs became 100% in step b and that resulted in
      blocked I/O on some virtual instances in the Openstack cluster.
      Also at that time one of our pools got a lot of extra data, those
      files where stored with root permissions since we did not
      restarted the Ceph daemons yet, the 'find' in step e found so much
      files that xargs (the shell) could not handle it (too many
      arguments). At that time we decided to keep the permissions on
      root in the upgrade phase.
The next and biggest problem we encountered had to do with the
      CRC errors on the OSD map. On every map update, the OSDs that were
      not upgraded yet, got that CRC error and asked the monitor for a
      full OSD map instead of just a delta update. At first we did not
      understand what exactly happened, we ran the upgrade per node
      using a script and in that script we watch the state of the
      cluster and when the cluster is healthy again, we upgrade the next
      host. Every time we started the script (skipping the already
      upgraded hosts) the first host(s) upgraded without issues and then
      we got blocked I/O on the cluster. The blocked I/O went away
      within a minute of 2 (not measured). After investigation we found
      out that the blocked I/O happened when nodes where asking the
      monitor for a (full) OSD map and that resulted shortly in a full
      saturated network link on our monitor.
In the next graph the statistics for one of our Ceph monitor is
      shown. Our hosts are equipped with 10 gbit/s NIC's and every time
      at the highest peaks, the problems occurred. We could work around
      this problem by waiting four minutes between every host and after
      that time (14:20) we did not have any issues any more. Of course
      the number of not upgraded OSDs decreased, so the number of full
      OSD map requests also got smaller in time.

    <mon0_network_hammer_to_jewel_upgrade.png>

The day after the upgrade we had issues with live migrations of
      Openstack instances. We got this message, "OSError:
      /usr/lib/librbd.so.1: undefined symbol:
      _ZN8librados5Rados15aio_watch_flushEPNS_13AioCompletionE". This is
      resolved by restarting libvirt-bin and nova-compute on every
      compute node.
Please notice that the upgrade of our biggest cluster was not a
      100% success, but the problems where relative small and the
      cluster stayed on-line and there where only a few virtual
      openstack instances that did not like the blocked I/O and had to
      be restarted.

    --
    With regards,

Richard Arends.
Snow BV / http://snow.nl

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick

Attachment:
signature.asc

Description: Message signed with OpenPGP
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com