Re: Lessons learned upgrading Hammer -> Jewel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for sharing Wido.

>From your information you only talk about MON and OSD. What about the
RGW nodes? You stated in the beginning that 99% is rgw...

On Wed, Jul 13, 2016 at 3:56 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> Hello,
>
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>
> The cluster in this case is 99% RGW, but also some RBD.
>
> I wanted to share some of the things we encountered during this upgrade.
>
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the fact that the monitors got spammed with these kind of messages:
>
>   "Failed to encode map eXXX with expected crc"
>
> Some searching on the list brought me to:
>
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>
>  Using Salt we upgraded the packages and afterwards it was simple:
>
>    killall ceph-mon
>    chown -R ceph:ceph /var/lib/ceph
>    chown -R ceph:ceph /var/log/ceph
>
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to manually enabled the monitor and start it:
>
>   systemctl enable ceph-mon@srv-zmb04-05.service
>   systemctl start ceph-mon@srv-zmb04-05.service
>
> Afterwards the monitors were running just fine.
>
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on all systems to 10.2.2, we then used a Shell script which we ran on one node at a time.
>
> The failure domain here is 'rack', so we executed this in one rack, then the next one, etc, etc.
>
> Script can be found on Github: https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>
> Be aware that the chown can take a long, long, very long time!
>
> We ran into the issue that some OSDs crashed after start. But after trying again they would start.
>
>   "void FileStore::init_temp_collections()"
>
> I reported this in the tracker as I'm not sure what is happening here: http://tracker.ceph.com/issues/16672
>
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
>
> Using Salt and ceph-disk we ran into a partprobe issue in combination with ceph-disk. There was already a Pull Request for the fix, but that was not included in Jewel 10.2.2.
>
> We manually applied the PR and it fixed our issues: https://github.com/ceph/ceph/pull/9330
>
> Hope this helps other people with their upgrades to Jewel!
>
> Wido
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux