Re: Lessons learned upgrading Hammer -> Jewel

Wido den Hollander <wido@xxxxxxxx> · Fri, 15 Jul 2016 11:19:42 +0200 (CEST)

> Op 15 juli 2016 om 10:48 schreef Mart van Santen <mart@xxxxxxxxxxxx>:
> 
> 
> 
> Hi Wido,
> 
> Thank you, we are currently in the same process so this information is
> very usefull. Can you share why you upgraded from hammer directly to
> jewel, is there a reason to skip infernalis? So, I wonder why you didn't
> do a hammer->infernalis->jewel upgrade, as that seems the logical path
> for me.
> 

LTS to LTS upgrades, that's why. Tested it in small on a few VMs and afterwards did the production cluster.

We needed to go to Jewel due to some fixes for large clusters and RGW features (AWS4) and fixes.

Wido

> (we did indeed saw the same errors "Failed to encode map eXXX with
> expected crc" when upgrading to the latest hammer)
> 
> 
> Regards,
> 
> Mart
> 
> 
> 
> 
> 
> 
> 
> On 07/15/2016 03:08 AM, 席智勇 wrote:
> > good job, thank you for sharing, Wido~
> > it's very useful~
> >
> > 2016-07-14 14:33 GMT+08:00 Wido den Hollander <wido@xxxxxxxx
> > <mailto:wido@xxxxxxxx>>:
> >
> >     To add, the RGWs upgraded just fine as well.
> >
> >     No regions in use here (yet!), so that upgraded as it should.
> >
> >     Wido
> >
> >     > Op 13 juli 2016 om 16:56 schreef Wido den Hollander
> >     <wido@xxxxxxxx <mailto:wido@xxxxxxxx>>:
> >     >
> >     >
> >     > Hello,
> >     >
> >     > The last 3 days I worked at a customer with a 1800 OSD cluster
> >     which had to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> >     >
> >     > The cluster in this case is 99% RGW, but also some RBD.
> >     >
> >     > I wanted to share some of the things we encountered during this
> >     upgrade.
> >     >
> >     > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> >     >
> >     > ** Hammer Upgrade **
> >     > At first we upgraded from 0.94.5 to 0.94.7, this went well
> >     except for the fact that the monitors got spammed with these kind
> >     of messages:
> >     >
> >     >   "Failed to encode map eXXX with expected crc"
> >     >
> >     > Some searching on the list brought me to:
> >     >
> >     >   ceph tell osd.* injectargs -- --clog_to_monitors=false
> >     >
> >     >  This reduced the load on the 5 monitors and made recovery
> >     succeed smoothly.
> >     >
> >     >  ** Monitors to Jewel **
> >     >  The next step was to upgrade the monitors from Hammer to Jewel.
> >     >
> >     >  Using Salt we upgraded the packages and afterwards it was simple:
> >     >
> >     >    killall ceph-mon
> >     >    chown -R ceph:ceph /var/lib/ceph
> >     >    chown -R ceph:ceph /var/log/ceph
> >     >
> >     > Now, a systemd quirck. 'systemctl start ceph.target' does not
> >     work, I had to manually enabled the monitor and start it:
> >     >
> >     >   systemctl enable ceph-mon@srv-zmb04-05.service
> >     >   systemctl start ceph-mon@srv-zmb04-05.service
> >     >
> >     > Afterwards the monitors were running just fine.
> >     >
> >     > ** OSDs to Jewel **
> >     > To upgrade the OSDs to Jewel we initially used Salt to update
> >     the packages on all systems to 10.2.2, we then used a Shell script
> >     which we ran on one node at a time.
> >     >
> >     > The failure domain here is 'rack', so we executed this in one
> >     rack, then the next one, etc, etc.
> >     >
> >     > Script can be found on Github:
> >     https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> >     >
> >     > Be aware that the chown can take a long, long, very long time!
> >     >
> >     > We ran into the issue that some OSDs crashed after start. But
> >     after trying again they would start.
> >     >
> >     >   "void FileStore::init_temp_collections()"
> >     >
> >     > I reported this in the tracker as I'm not sure what is happening
> >     here: http://tracker.ceph.com/issues/16672
> >     >
> >     > ** New OSDs with Jewel **
> >     > We also had some new nodes which we wanted to add to the Jewel
> >     cluster.
> >     >
> >     > Using Salt and ceph-disk we ran into a partprobe issue in
> >     combination with ceph-disk. There was already a Pull Request for
> >     the fix, but that was not included in Jewel 10.2.2.
> >     >
> >     > We manually applied the PR and it fixed our issues:
> >     https://github.com/ceph/ceph/pull/9330
> >     >
> >     > Hope this helps other people with their upgrades to Jewel!
> >     >
> >     > Wido
> >     > _______________________________________________
> >     > ceph-users mailing list
> >     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
> Mart van Santen
> Greenhost
> E: mart@xxxxxxxxxxxx
> T: +31 20 4890444
> W: https://greenhost.nl
> 
> A PGP signature can be attached to this e-mail,
> you need PGP software to verify it. 
> My public key is available in keyserver(s)
> see: http://tinyurl.com/openpgp-manual
> 
> PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com