Fixed all active+remapped PGs stuck forever (but I have no clue why)

chibi@xxxxxxx (Christian Balzer) · Thu, 14 Aug 2014 16:35:13 +0900

Hello,

On Thu, 14 Aug 2014 01:38:05 -0500 John Morris wrote:

> 
> On 08/13/2014 11:36 PM, Christian Balzer wrote:
> >
> > Hello,
> >
> > On Thu, 14 Aug 2014 03:38:11 +0000 David Moreau Simard wrote:
> >
> >> Hi,
> >>
> >> Trying to update my continuous integration environment.. same
> >> deployment method with the following specs:
> >> - Ubuntu Precise, Kernel 3.2, Emperor (0.72.2) - Yields a successful,
> >> healthy cluster.
> >> - Ubuntu Trusty, Kernel 3.13, Firefly (0.80.5) - I have stuck
> >> placement groups.
> >>
> >> Here?s some relevant bits from the Trusty/Firefly setup before I move
> >> on to what I?ve done/tried: http://pastebin.com/eqQTHcxU <? This was
> >> about halfway through PG healing.
> >>
> >> So, the setup is three monitors, two other hosts on which there are 9
> >> OSDs each. At the beginning, all my placement groups were stuck
> >> unclean.
> >>
> > And there's your reason why the firefly install "failed".
> > The default replication is 3 and you have just 2 storage nodes,
> > combined with the default CRUSH rules that's exactly what will happen.
> > To avoid this from the start either use 3 nodes or set
> > ---
> > osd_pool_default_size = 2
> > osd_pool_default_min_size = 1
> > ---
> > in your ceph.conf very early on, before creating anything, especially
> > OSDs.
> >
> > Setting the replication for all your pools to 2 with "ceph osd pool
> > <name> set size 2" as the first step after your install should have
> > worked, too.
> 
> Did something change between Emperor and Firefly that the OP would 
> experience this problem only after upgrading and no other configuration 
> changes?
> 
No, not really.
Well aside from the happy warning that you're running legacy tunables and
people (including me, but that was a non-production cluster) taking that
to set tunables optimal and getting a nice workout of their hardware.

I took the OP's statement not to be an upgrade per se, but a fresh install
with either emperor or firefly on his test cluster.

> Your explanation updates my understanding of how the CRUSH algorithm 
> works.  Take this osd tree for example:
> 
> rack rack0
> 	host host0
> 		osd.0
> 		osd.1
> 	host host1
> 		osd.2
> 		osd.3
> 
> I had thought that with size=3, CRUSH would do its best at any 
> particular level of buckets to distribute replicas across failure 
> domains as best as possible, and otherwise try to keep balance.
> 
> Instead, you seem to say at the 'host' bucket level of the CRUSH map, 
> distribution MUST be across size=3 failure domains.  

The default (firefly, but previous ones are functionally identical) crush
map has:
---
# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
---

The type host states that there will be not more that one replica per host
(node), so with size=3 you will need at least 3 hosts to choose from. 
If you were to change this to to type OSD, all 3 replicas could wind up on
the same host, not really a good idea.

>In the above osd 
> tree, why does the 'rack' level with the single 'rack0' failure domain 
> not cause the OP's stuck PG problem, even with size=2?  Is that level 
> treated specially for some reason?
> 
Aside from the little detail that the OP has only hosts and osds defined
in his crush map (default after all), the rack and all other bucket types
are ONLY taken into consideration when an actual rule calls on them or if
they are part of the subtree of such a bucket (as in the case of osd being
below host). 

> What if the osd tree looked like this:
> 
> rack rack0
> 	host host0
> 		osd.0
> 		osd.1
> 	host host1
> 		osd.2
> 		osd.3
> rack rack1
> 	host host2
> 		osd.4
> 		osd.5
> 
> Here, I would expect size=2 to always put one replica on each rack. 
Nope, see above. 
CRUSH is not clairvoyant, it needs to be told what you want to do with
bucket types. 

> With size=3 in my previous understanding, I would have hoped for one 
> replica on each host.  
Yes, with default rules.

> With the changes in firefly (or the difference in 
> my understanding vs. reality), would size=3 instead result in stuck PGs, 
> since at the rack level there are only two failure domains, mirroring 
> the OP's problem but at the next higher level?
> 

Only if you had the racks included in the rules.

> If not, would it be a solution for the OP be to artificially split the 
> OSDs on each node into another level of buckets, such as this 
> (disgusting) scheme:
> 
> rack rack0
> 	host host0
> 		bogus 0
> 			osd.0
> 		bogus 1
> 			osd.1
> 	host host1
> 		bogus 2
> 			osd.2
> 		bogus 3
> 			osd.3
> 

You might finackle something like that (again the rule splits on hosts) by
having multiple "hosts" on one physical machine, but therein lies madness.

> Thanks in advance for comments.  I'm about to reorganize my CRUSH map 
> (see 'CRUSH map advice' thread), and need this reality check.
>
I was going to comment on that thread, probably a bit later.

Christian

> 	John
> 
> 
> >
> > But with all the things you tried, I can't really tell you why things
> > behaved they way they did for you.
> >
> > Christian
> >
> >> I tried the easy things first:
> >> - set crush tunables to optimal
> >> - run repairs/scrub on OSDs
> >> - restart OSDs
> >>
> >> Nothing happened. All ~12000 PGs remained stuck unclean since forever
> >> active+remapped. Next, I played with the crush map. I deleted the
> >> default replicated_ruleset rule and created a (basic) rule for each
> >> pool for the time being. I set the pools to use their respective rule
> >> and also reduced their size to 2 and min_size to 1.
> >>
> >> Still nothing, all PGs stuck.
> >> I?m not sure why but I tried setting the crush tunables to legacy - I
> >> guess in a trial and error attempt.
> >>
> >> Half my PGs healed almost immediately. 6082 PGs remained in
> >> active+remapped. I try running scrubs/repairs - it won?t heal the
> >> other half. I set the tunables back to optimal, still nothing.
> >>
> >> I set tunables to legacy again and most of them end up healing with
> >> only 1335 left in active+remapped.
> >>
> >> The remainder of the PGs healed when I restarted the OSDs.
> >>
> >> Does anyone have a clue why this happened ?
> >> It looks like switching back and forth between tunables fixed the
> >> stuck PGs ?
> >>
> >> I can easily reproduce this if anyone wants more info.
> >>
> >> Let me know !
> >> --
> >> David Moreau Simard
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users at lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/