David Moreau Simard <dmsimard at ...> writes: > > Hi, > > Trying to update my continuous integration environment.. same deployment method with the following specs: > - Ubuntu Precise, Kernel 3.2, Emperor (0.72.2) - Yields a successful, healthy cluster. > - Ubuntu Trusty, Kernel 3.13, Firefly (0.80.5) - I have stuck placement groups. > > Here?s some relevant bits from the Trusty/Firefly setup before I move on to what I?ve done/tried: > http://pastebin.com/eqQTHcxU <? This was about halfway through PG healing. > > So, the setup is three monitors, two other hosts on which there are 9 OSDs each. > At the beginning, all my placement groups were stuck unclean. > > I tried the easy things first: > - set crush tunables to optimal > - run repairs/scrub on OSDs > - restart OSDs > > Nothing happened. All ~12000 PGs remained stuck unclean since forever active+remapped. > Next, I played with the crush map. I deleted the default replicated_ruleset rule and created a (basic) rule > for each pool for the time being. > I set the pools to use their respective rule and also reduced their size to 2 and min_size to 1. > > Still nothing, all PGs stuck. > I?m not sure why but I tried setting the crush tunables to legacy - I guess in a trial and error attempt. > > Half my PGs healed almost immediately. 6082 PGs remained in active+remapped. > I try running scrubs/repairs - it won?t heal the other half. I set the tunables back to optimal, still nothing. > > I set tunables to legacy again and most of them end up healing with only 1335 left in active+remapped. > > The remainder of the PGs healed when I restarted the OSDs. > > Does anyone have a clue why this happened ? > It looks like switching back and forth between tunables fixed the stuck PGs ? > > I can easily reproduce this if anyone wants more info. > > Let me know ! > -- > David Moreau Simard > I recently encountered the exact same problem. I have been working on a new cloud deployment using vagrant to simulate the physical hosts. I have 4 hosts, each is both a mon and osd for testing purposes. System details: Ubuntu Trusty (14.04) Kernel 3.13 Firefly 0.80.5 On deployment of a new cluster, all of my pgs were stuck (HEALTH_WARN 320 pgs incomplete; 320 pgs stuck inactive; 320 pgs stuck unclean). I tried a ton of recommended processes for getting them working and nothing could get them to budge. I did `ceph osd crush tunables legacy` and all 320 pgs went from stuck to active. This is definitely repeatable as I can deploy a new cluster with vagrant/puppet and this happens every time. So, thank you for posting this work-around. Peter