Hello, On Fri, 15 Jul 2016 10:48:40 +0200 Mart van Santen wrote: > > Hi Wido, > > Thank you, we are currently in the same process so this information is > very usefull. Can you share why you upgraded from hammer directly to > jewel, is there a reason to skip infernalis? So, I wonder why you didn't > do a hammer->infernalis->jewel upgrade, as that seems the logical path > for me. > Hammer and Jewel are long term (for various definitions of long term) stable releases. So upgrades from Hammer to Jewel are the logical thing and the one most people who don't care about bleeding edge on their production clusters. Infernalis stopped receiving any updates/bugfixes the moment Jewel was released. So theoretically you might be upgrading into something that has known and unfixed bugs when going via Infernalis. And then there's the whole thing of restarting all your MONs and OSDs with all the potential fun that can entail (as well as likely being forced to do this during late night/weekend maintenance windows). >From where I'm standing upgrading to the latest Hammer and then Jewel is already one step too many, no need to add another one. Lastly, given all the outstanding issues with 0.94.7 AND the latest Jewel I'm going to sit on the sidelines some more, especially since my staging cluster HW just arrived. Christian > (we did indeed saw the same errors "Failed to encode map eXXX with > expected crc" when upgrading to the latest hammer) > > > Regards, > > Mart > > > > > > > > On 07/15/2016 03:08 AM, 席智勇 wrote: > > good job, thank you for sharing, Wido~ > > it's very useful~ > > > > 2016-07-14 14:33 GMT+08:00 Wido den Hollander <wido@xxxxxxxx > > <mailto:wido@xxxxxxxx>>: > > > > To add, the RGWs upgraded just fine as well. > > > > No regions in use here (yet!), so that upgraded as it should. > > > > Wido > > > > > Op 13 juli 2016 om 16:56 schreef Wido den Hollander > > <wido@xxxxxxxx <mailto:wido@xxxxxxxx>>: > > > > > > > > > Hello, > > > > > > The last 3 days I worked at a customer with a 1800 OSD cluster > > which had to be upgraded from Hammer 0.94.5 to Jewel 10.2.2 > > > > > > The cluster in this case is 99% RGW, but also some RBD. > > > > > > I wanted to share some of the things we encountered during this > > upgrade. > > > > > > All 180 nodes are running CentOS 7.1 on a IPv6-only network. > > > > > > ** Hammer Upgrade ** > > > At first we upgraded from 0.94.5 to 0.94.7, this went well > > except for the fact that the monitors got spammed with these kind > > of messages: > > > > > > "Failed to encode map eXXX with expected crc" > > > > > > Some searching on the list brought me to: > > > > > > ceph tell osd.* injectargs -- --clog_to_monitors=false > > > > > > This reduced the load on the 5 monitors and made recovery > > succeed smoothly. > > > > > > ** Monitors to Jewel ** > > > The next step was to upgrade the monitors from Hammer to Jewel. > > > > > > Using Salt we upgraded the packages and afterwards it was simple: > > > > > > killall ceph-mon > > > chown -R ceph:ceph /var/lib/ceph > > > chown -R ceph:ceph /var/log/ceph > > > > > > Now, a systemd quirck. 'systemctl start ceph.target' does not > > work, I had to manually enabled the monitor and start it: > > > > > > systemctl enable ceph-mon@srv-zmb04-05.service > > > systemctl start ceph-mon@srv-zmb04-05.service > > > > > > Afterwards the monitors were running just fine. > > > > > > ** OSDs to Jewel ** > > > To upgrade the OSDs to Jewel we initially used Salt to update > > the packages on all systems to 10.2.2, we then used a Shell script > > which we ran on one node at a time. > > > > > > The failure domain here is 'rack', so we executed this in one > > rack, then the next one, etc, etc. > > > > > > Script can be found on Github: > > https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6 > > > > > > Be aware that the chown can take a long, long, very long time! > > > > > > We ran into the issue that some OSDs crashed after start. But > > after trying again they would start. > > > > > > "void FileStore::init_temp_collections()" > > > > > > I reported this in the tracker as I'm not sure what is happening > > here: http://tracker.ceph.com/issues/16672 > > > > > > ** New OSDs with Jewel ** > > > We also had some new nodes which we wanted to add to the Jewel > > cluster. > > > > > > Using Salt and ceph-disk we ran into a partprobe issue in > > combination with ceph-disk. There was already a Pull Request for > > the fix, but that was not included in Jewel 10.2.2. > > > > > > We manually applied the PR and it fixed our issues: > > https://github.com/ceph/ceph/pull/9330 > > > > > > Hope this helps other people with their upgrades to Jewel! > > > > > > Wido > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com