Ah, I was afraid it would be related to the amount of replicas versus the amount of host buckets. Makes sense. I was unable to reproduce the issue with three hosts and one OSD on each host. Thanks. -- David Moreau Simard On Aug 14, 2014, at 12:36 AM, Christian Balzer <chibi at gol.com<mailto:chibi at gol.com>> wrote: Hello, On Thu, 14 Aug 2014 03:38:11 +0000 David Moreau Simard wrote: Hi, Trying to update my continuous integration environment.. same deployment method with the following specs: - Ubuntu Precise, Kernel 3.2, Emperor (0.72.2) - Yields a successful, healthy cluster. - Ubuntu Trusty, Kernel 3.13, Firefly (0.80.5) - I have stuck placement groups. Here?s some relevant bits from the Trusty/Firefly setup before I move on to what I?ve done/tried: http://pastebin.com/eqQTHcxU <? This was about halfway through PG healing. So, the setup is three monitors, two other hosts on which there are 9 OSDs each. At the beginning, all my placement groups were stuck unclean. And there's your reason why the firefly install "failed". The default replication is 3 and you have just 2 storage nodes, combined with the default CRUSH rules that's exactly what will happen. To avoid this from the start either use 3 nodes or set --- osd_pool_default_size = 2 osd_pool_default_min_size = 1 --- in your ceph.conf very early on, before creating anything, especially OSDs. Setting the replication for all your pools to 2 with "ceph osd pool <name> set size 2" as the first step after your install should have worked, too. But with all the things you tried, I can't really tell you why things behaved they way they did for you. Christian I tried the easy things first: - set crush tunables to optimal - run repairs/scrub on OSDs - restart OSDs Nothing happened. All ~12000 PGs remained stuck unclean since forever active+remapped. Next, I played with the crush map. I deleted the default replicated_ruleset rule and created a (basic) rule for each pool for the time being. I set the pools to use their respective rule and also reduced their size to 2 and min_size to 1. Still nothing, all PGs stuck. I?m not sure why but I tried setting the crush tunables to legacy - I guess in a trial and error attempt. Half my PGs healed almost immediately. 6082 PGs remained in active+remapped. I try running scrubs/repairs - it won?t heal the other half. I set the tunables back to optimal, still nothing. I set tunables to legacy again and most of them end up healing with only 1335 left in active+remapped. The remainder of the PGs healed when I restarted the OSDs. Does anyone have a clue why this happened ? It looks like switching back and forth between tunables fixed the stuck PGs ? I can easily reproduce this if anyone wants more info. Let me know ! -- David Moreau Simard _______________________________________________ ceph-users mailing list ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi at gol.com<mailto:chibi at gol.com> Global OnLine Japan/Fusion Communications http://www.gol.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140814/0197a095/attachment.htm>