Re: [EXTERNAL] Big problems encoutered during upgrade from hammer 0.94.5 to jewel 10.2.3

"Will.Boege" <Will.Boege@xxxxxxxxxx> · Mon, 14 Nov 2016 01:18:47 +0000

Hi Vincent,

When I did a similar upgrade I found that having mixed version OSDs caused issues much like yours. My advice to you is to power through the upgrade as fast as possible. Pretty sure this is related to an issue/bug discussed here previously around excessive load on the monitors in mixed version environments. 

I found that restarting a monitor after taking an OSD node down helped the cluster to ID down OSDs faster reducing the amount of blocked IO reported by clients. 

> On Nov 13, 2016, at 12:29 PM, Vincent Godin <vince.mlist@xxxxxxxxx> wrote:
> 
> After a test on a non production environment, we decided to upgrade our running cluster to jewel 10.2.3. Our cluster has 3 monitors and 8 nodes of 20 disks. The cluster is in hammer 0.94.5 with tunables set to "bobtail".
> As the cluster is in production and it wasn't possible to upgrade ceph client at the same time, so we decided to keep the tunables in bobtail
> .
> First step was to upgrade the three monitors : no problem
> Second step : put the cluster in noout and then upgrade the first node : as soon as we stopped the OSDs on the first node, the cluster went in error with a lot of PG peering. We lost a lot of disks on the VMs hosted by ceph clients.A lot of OSD went flapping (down then up) for hours.
> So we decide to stop all the VMs and so the IOs on the cluster for it to stabilize and it took about 3 hours. With no IO on the cluster, we arrived to upgrade 4 nodes (on 8). 
> 
> At this time, we have pools which are spread only on these 4 nodes now on jewel. But still now, if we stop an OSD on one of this 4 nodes, PGs are still peering and the cluster is in error status and then not good to serve the production needs. 
> Is this behaviour because of a mix of node in jewel and in hammer ? 
> 
> We will upgrade the last 4 nodes next week-end so all the OSD nodes will be in jewel. Do we have to wait for the ceph clients upgrade in jewel to recover a stable cluster ? Do we have to wait till the tunables are set to optimal ?
> 
> I saw in the release notes that an upgrade from hammer to jewel could be done without downtime ...I known that there is no garanty but for now, we still have an unstable cluster and pray for not loosing an OSD before the last operation of upgrade
> 
> If you have somme advices, i'll take tem :)
> 
> Vincent
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com