hammer to jewel upgrade experiences? cache tier experience?

chibi@xxxxxxx (Christian Balzer) · Wed, 8 Mar 2017 11:07:40 +0900

[re-adding ML, so others may benefit]

On Tue, 7 Mar 2017 13:14:14 -0700 Mike Lovell wrote:

> On Mon, Mar 6, 2017 at 8:18 PM, Christian Balzer <chibi at gol.com> wrote:
> 
> > On Mon, 6 Mar 2017 19:57:11 -0700 Mike Lovell wrote:
> >  
> > > has anyone on the list done an upgrade from hammer (something later than
> > > 0.94.6) to jewel with a cache tier configured? i tried doing one last  
> > week  
> > > and had a hiccup with it. i'm curious if others have been able to
> > > successfully do the upgrade and, if so, did they take any extra steps
> > > related to the cache tier?
> > >  
> > It would be extremely helpful for everybody involved if you could be bit
> > more specific than "hiccup".
> >  
> 
> the problem we had was osds in the cache tier were crashing and it made the
> cluster unusable for a while. http://tracker.ceph.com/issues/19185 is a
> tracker issue i made for it. i'm guessing not many others have seen the
> same issue. i'm just wondering if others have successfully done an upgrade
> with an active cache tier and how things went.
>
Yeah, I saw that a bit later, looks like you found/hit a genuine bug.

> I've upgraded one crappy test cluster from hammer to jewel w/o issues and
> > am about to do that on a more realistic, busier test cluster as well.
> >
I did upgrade the other test cluster, that had actual traffic (to/through
the cache) going on during the upgrade without any issues.

Maybe Kefu Chai can comment on why this is not something seen by everyone,
one thing I can think of is that I didn't change any defaults, in
particular "hit_set_period".

> > OTOH, I have no plans to upgrade my production Hammer cluster with a cache
> > tier at this point.
> >  
> 
> interesting. do you not have plans just because you are still testing? or
> is there just no desire or need to upgrade?
> 
All of the above. 
That (small) cluster is serving 9 compute nodes and that whole
installation has reached its max build-out, it will NOT grow any further.
Hammer is working fine, nobody involved is interesting in upgrading things
willy-nilly (which would involve the compute nodes at some point as well)
for a service that needs to be as close to 24/7 as possible.

While I would like to eventually replace old HW consecutively about 3-4
years down the line and thus require "current" SW, migrating everything
off that installation and starting fresh is also an option.

If you do an upgrade of a compute node, you can live migrate things away
from it first and if it doesn't pan out, no harm done.

If you run into a "hiccup" with a Ceph upgrade (especially one that
doesn't manifest itself immediately on the first MON/OSD being upgraded),
your whole installation with (in my case) hundreds of VMs is dead in the
water, given the exact circumstances for a prolonged period.

Not a particular sunny or career enhancing prospect.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/