Re: CRUSH question - failing to rebalance after failure test

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 14 Jan 2015 09:19:43 -0800 (PST)

On Tue, 13 Jan 2015, Christopher Kunz wrote:
> Hi,
> 
> > Okay, it sounds like something is not quite right then.  Can you attach 
> > the OSDMap once it is in the not-quite-repaired state?  And/or try 
> > setting 'ceph osd crush tunables optimal' and see if that has any 
> > effect?
> > 
> Indeed it did - I set ceph osd crush tunables optimal (80% degradation)
>  and unplugged one sled. After manually setting the OSDs down and out,
> the cluster degraded to over 80% again and recovered within a couple
> minutes (I only have 14K objects there).
> 
> So I probably set something to a very wrong value or the constant
> switching between replica size 2 and 3 confused the cluster?
> 
> > Cute!  That kind of looks like 3 sleds of 7 in one chassis though?  Or am 
> > I looking at the wrong thing?
> > 
> Yeah, but the "sled" failure domain is not existant in default CRUSH
> maps. It seemed OKish to use "chassis" for the PoC. I might write a more
> heavily customized CRUSH map after I figure out what I can productively
> do with the cluster. :)

The types are just names; we put the default ones in there that seemed 
tlike they would b ethemost common but we could easily add sled in 
(between host and chassis?) if that is something that is reasonably 
common...

> I have one more issue that I'm trying to reproduce right now, but so far
> the "tunables optimal" trick helped tremendously, thanks!

Great!

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com