Re: CRUSH question - failing to rebalance after failure test

Christopher Kunz <chrislist@xxxxxxxxxxx> · Tue, 13 Jan 2015 09:22:37 +0100

Hi,

> Okay, it sounds like something is not quite right then.  Can you attach 
> the OSDMap once it is in the not-quite-repaired state?  And/or try 
> setting 'ceph osd crush tunables optimal' and see if that has any 
> effect?
> 
Indeed it did - I set ceph osd crush tunables optimal (80% degradation)
 and unplugged one sled. After manually setting the OSDs down and out,
the cluster degraded to over 80% again and recovered within a couple
minutes (I only have 14K objects there).

So I probably set something to a very wrong value or the constant
switching between replica size 2 and 3 confused the cluster?

> Cute!  That kind of looks like 3 sleds of 7 in one chassis though?  Or am 
> I looking at the wrong thing?
> 
Yeah, but the "sled" failure domain is not existant in default CRUSH
maps. It seemed OKish to use "chassis" for the PoC. I might write a more
heavily customized CRUSH map after I figure out what I can productively
do with the cluster. :)

I have one more issue that I'm trying to reproduce right now, but so far
the "tunables optimal" trick helped tremendously, thanks!

Regards,

--ck

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com