On 08/13/2014 11:36 PM, Christian Balzer wrote: > > Hello, > > On Thu, 14 Aug 2014 03:38:11 +0000 David Moreau Simard wrote: > >> Hi, >> >> Trying to update my continuous integration environment.. same deployment >> method with the following specs: >> - Ubuntu Precise, Kernel 3.2, Emperor (0.72.2) - Yields a successful, >> healthy cluster. >> - Ubuntu Trusty, Kernel 3.13, Firefly (0.80.5) - I have stuck placement >> groups. >> >> Here?s some relevant bits from the Trusty/Firefly setup before I move on >> to what I?ve done/tried: http://pastebin.com/eqQTHcxU <? This was about >> halfway through PG healing. >> >> So, the setup is three monitors, two other hosts on which there are 9 >> OSDs each. At the beginning, all my placement groups were stuck unclean. >> > And there's your reason why the firefly install "failed". > The default replication is 3 and you have just 2 storage nodes, combined > with the default CRUSH rules that's exactly what will happen. > To avoid this from the start either use 3 nodes or set > --- > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > --- > in your ceph.conf very early on, before creating anything, especially > OSDs. > > Setting the replication for all your pools to 2 with "ceph osd pool <name> > set size 2" as the first step after your install should have worked, too. Did something change between Emperor and Firefly that the OP would experience this problem only after upgrading and no other configuration changes? Your explanation updates my understanding of how the CRUSH algorithm works. Take this osd tree for example: rack rack0 host host0 osd.0 osd.1 host host1 osd.2 osd.3 I had thought that with size=3, CRUSH would do its best at any particular level of buckets to distribute replicas across failure domains as best as possible, and otherwise try to keep balance. Instead, you seem to say at the 'host' bucket level of the CRUSH map, distribution MUST be across size=3 failure domains. In the above osd tree, why does the 'rack' level with the single 'rack0' failure domain not cause the OP's stuck PG problem, even with size=2? Is that level treated specially for some reason? What if the osd tree looked like this: rack rack0 host host0 osd.0 osd.1 host host1 osd.2 osd.3 rack rack1 host host2 osd.4 osd.5 Here, I would expect size=2 to always put one replica on each rack. With size=3 in my previous understanding, I would have hoped for one replica on each host. With the changes in firefly (or the difference in my understanding vs. reality), would size=3 instead result in stuck PGs, since at the rack level there are only two failure domains, mirroring the OP's problem but at the next higher level? If not, would it be a solution for the OP be to artificially split the OSDs on each node into another level of buckets, such as this (disgusting) scheme: rack rack0 host host0 bogus 0 osd.0 bogus 1 osd.1 host host1 bogus 2 osd.2 bogus 3 osd.3 Thanks in advance for comments. I'm about to reorganize my CRUSH map (see 'CRUSH map advice' thread), and need this reality check. John > > But with all the things you tried, I can't really tell you why things > behaved they way they did for you. > > Christian > >> I tried the easy things first: >> - set crush tunables to optimal >> - run repairs/scrub on OSDs >> - restart OSDs >> >> Nothing happened. All ~12000 PGs remained stuck unclean since forever >> active+remapped. Next, I played with the crush map. I deleted the >> default replicated_ruleset rule and created a (basic) rule for each pool >> for the time being. I set the pools to use their respective rule and >> also reduced their size to 2 and min_size to 1. >> >> Still nothing, all PGs stuck. >> I?m not sure why but I tried setting the crush tunables to legacy - I >> guess in a trial and error attempt. >> >> Half my PGs healed almost immediately. 6082 PGs remained in >> active+remapped. I try running scrubs/repairs - it won?t heal the other >> half. I set the tunables back to optimal, still nothing. >> >> I set tunables to legacy again and most of them end up healing with only >> 1335 left in active+remapped. >> >> The remainder of the PGs healed when I restarted the OSDs. >> >> Does anyone have a clue why this happened ? >> It looks like switching back and forth between tunables fixed the stuck >> PGs ? >> >> I can easily reproduce this if anyone wants more info. >> >> Let me know ! >> -- >> David Moreau Simard >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >