inconsistent pgs

james@xxxxxxxxxxxxxxxxx (James Harper) · Mon, 7 Jul 2014 22:57:08 +0000

> 
> What was the exact sequence of events
> 

Exact sequence of events was:
. set retiring node OSD's out
. noticed that mds's were now stuck in 'rejoining' state
. messed around with restarting mds's but couldn't fix
. google told me that upgrading ceph resolved such a problem for them
. upgraded all binaries (apt-get install ...)
. restarted all mons
. (noticed that apt-get had grabbed firefly from Jessie instead of dumpling from ceph.com - I thought I might just be grabbing a bugfix for dumpling)
. restarted all osds
. restarted all mds's
. mds's came good and cluster was healthy again (and still moving pg's from retiring node), but getting a  warning about legacy tunables
. read the release notes for instructions on what the tunables message meant. I am running kernel 3.14 but not using the kernel rbd driver so assume that would be okay (is that correct?). Set tunables to optimal
. alerts that cluster was degraded with osd's down
. messed around restarting osd's until I found that the cluster remained stable with the osd's on the retiring node stopped - starting either of the 2 osd's on there resulted in the cascade of crashing osd's
. on a whim, set tunables back to legacy and the cluster became stable again. The pg's all migrated from the retiring node and I removed it from the cluster

It was getting late by then so things got a bit hazy towards the end but I'm pretty sure that's how it all went down. The fact that my mds's got stuck after setting one node out makes me think there is something else at work and it was an indirect force at work that meant legacy=stable and optimal=crashy. I can't see what though - everything had been working great up until that point. I haven't touched the tunables since then so I still get the constant warning.

I'd kind of prefer to stick with the deb's from ceph.com - I hadn't noticed that they were included in Jessie until it was too late, and qemu now depends on them so I guess I'm stuck with the Debian repo versions anyway...

thanks

James

> ? were you rebalancing when you
> did the upgrade? Did the marked out OSDs get upgraded?
> Did you restart all the monitors prior to changing the tunables? (Are
> you *sure*?)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Sat, Jul 5, 2014 at 10:31 PM, James Harper <james at ejbdigital.com.au>
> wrote:
> >>
> >> I have 4 physical boxes each running 2 OSD's. I needed to retire one so I
> set
> >> the 2 OSD's on it to 'out' and everything went as expected. Then I noticed
> >> that 'ceph health' was reporting that my crush map had legacy tunables.
> The
> >> release notes told me I needed to do 'ceph osd crush tunables optimal' to
> fix
> >> this, and I wasn't running any old kernel clients, so I made it so. Shortly
> after
> >> that, my OSD's started dying until only one remained. I eventually figured
> out
> >> that they would stay up until I started the OSD's on the 'out' node. I hadn't
> >> made the connection to the tunables until I turned up an old mailing list
> post,
> >> but sure enough setting the tunables back to legacy got everything stable
> >> again. I assume that the churn introduced by 'optimal' resulted in the
> >> situation where the 'out' node stored the only copy of some data,
> because
> >> there were down pgs until I got all the OSD's running again
> >>
> >
> > Forgot to add, on the 'out' node, the following would be logged in the osd
> logfile:
> >
> > 7f5688e59700 -1 osd/PG.cc: In function 'void PG::fulfill_info(pg_shard_t,
> const pg_query_t&, std::pair<pg_shard_t, pg_info_t>&)' thread
> 7f5688e59700 time 2014-07-05 21:47:51.595687
> > osd/PG.cc: 4424: FAILED assert(from == primary)
> >
> > and in the others when they crashed:
> >
> > 7fdcb9600700 -1 osd/PG.cc: In function
> 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::Recover
> yState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)'
> thread 7fdcb9600700 time 2014-07-05 21:14:57.260547
> > osd/PG.cc: 5307: FAILED assert(0 == "we got a bad state machine event")
> > (sometimes that would appear in the 'out' node too).
> >
> > Even after the rebalance is complete and the old node is completely
> retired,  with one node down and 2 still running (as a test), I get a very small
> number (0.006%) of "unfound" pg's. This is a bit of a worry...
> >
> > James
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com