Fixed all active+remapped PGs stuck forever (but I have no clue why)

john@xxxxxxxxxxx (John Morris) · Thu, 14 Aug 2014 01:38:05 -0500

On 08/13/2014 11:36 PM, Christian Balzer wrote:
>
> Hello,
>
> On Thu, 14 Aug 2014 03:38:11 +0000 David Moreau Simard wrote:
>
>> Hi,
>>
>> Trying to update my continuous integration environment.. same deployment
>> method with the following specs:
>> - Ubuntu Precise, Kernel 3.2, Emperor (0.72.2) - Yields a successful,
>> healthy cluster.
>> - Ubuntu Trusty, Kernel 3.13, Firefly (0.80.5) - I have stuck placement
>> groups.
>>
>> Here?s some relevant bits from the Trusty/Firefly setup before I move on
>> to what I?ve done/tried: http://pastebin.com/eqQTHcxU <? This was about
>> halfway through PG healing.
>>
>> So, the setup is three monitors, two other hosts on which there are 9
>> OSDs each. At the beginning, all my placement groups were stuck unclean.
>>
> And there's your reason why the firefly install "failed".
> The default replication is 3 and you have just 2 storage nodes, combined
> with the default CRUSH rules that's exactly what will happen.
> To avoid this from the start either use 3 nodes or set
> ---
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> ---
> in your ceph.conf very early on, before creating anything, especially
> OSDs.
>
> Setting the replication for all your pools to 2 with "ceph osd pool <name>
> set size 2" as the first step after your install should have worked, too.

Did something change between Emperor and Firefly that the OP would 
experience this problem only after upgrading and no other configuration 
changes?

Your explanation updates my understanding of how the CRUSH algorithm 
works.  Take this osd tree for example:

rack rack0
	host host0
		osd.0
		osd.1
	host host1
		osd.2
		osd.3

I had thought that with size=3, CRUSH would do its best at any 
particular level of buckets to distribute replicas across failure 
domains as best as possible, and otherwise try to keep balance.

Instead, you seem to say at the 'host' bucket level of the CRUSH map, 
distribution MUST be across size=3 failure domains.  In the above osd 
tree, why does the 'rack' level with the single 'rack0' failure domain 
not cause the OP's stuck PG problem, even with size=2?  Is that level 
treated specially for some reason?

What if the osd tree looked like this:

rack rack0
	host host0
		osd.0
		osd.1
	host host1
		osd.2
		osd.3
rack rack1
	host host2
		osd.4
		osd.5

Here, I would expect size=2 to always put one replica on each rack. 
With size=3 in my previous understanding, I would have hoped for one 
replica on each host.  With the changes in firefly (or the difference in 
my understanding vs. reality), would size=3 instead result in stuck PGs, 
since at the rack level there are only two failure domains, mirroring 
the OP's problem but at the next higher level?

If not, would it be a solution for the OP be to artificially split the 
OSDs on each node into another level of buckets, such as this 
(disgusting) scheme:

rack rack0
	host host0
		bogus 0
			osd.0
		bogus 1
			osd.1
	host host1
		bogus 2
			osd.2
		bogus 3
			osd.3

Thanks in advance for comments.  I'm about to reorganize my CRUSH map 
(see 'CRUSH map advice' thread), and need this reality check.

	John

>
> But with all the things you tried, I can't really tell you why things
> behaved they way they did for you.
>
> Christian
>
>> I tried the easy things first:
>> - set crush tunables to optimal
>> - run repairs/scrub on OSDs
>> - restart OSDs
>>
>> Nothing happened. All ~12000 PGs remained stuck unclean since forever
>> active+remapped. Next, I played with the crush map. I deleted the
>> default replicated_ruleset rule and created a (basic) rule for each pool
>> for the time being. I set the pools to use their respective rule and
>> also reduced their size to 2 and min_size to 1.
>>
>> Still nothing, all PGs stuck.
>> I?m not sure why but I tried setting the crush tunables to legacy - I
>> guess in a trial and error attempt.
>>
>> Half my PGs healed almost immediately. 6082 PGs remained in
>> active+remapped. I try running scrubs/repairs - it won?t heal the other
>> half. I set the tunables back to optimal, still nothing.
>>
>> I set tunables to legacy again and most of them end up healing with only
>> 1335 left in active+remapped.
>>
>> The remainder of the PGs healed when I restarted the OSDs.
>>
>> Does anyone have a clue why this happened ?
>> It looks like switching back and forth between tunables fixed the stuck
>> PGs ?
>>
>> I can easily reproduce this if anyone wants more info.
>>
>> Let me know !
>> --
>> David Moreau Simard
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>