Cool, glad that worked. You'll have to read backwards in the ML to find this discussed -- though it is rarely needed, therefore rarely discussed. For code, it's used in src/crush/mapper.c. Most clusters, irrespective of size, work with 50 tries. Clusters that need more than 50 tries usually have some irregularity to their CRUSH tree -- in your case it's 2 big hosts, 2 small hosts. For a 3 replica PG, the CRUSH algorithm makes random tries to find 3 unique OSDs to satisfy the CRUSH rule. But some times 50 tries isn't enough... it just needs a few more to find that elusive 3rd replica. The signature of this issue is as you saw -- 3 replica pool, but a PG is stuck with only 2 up/acting OSDs. Regards, Dan P.S. Looking again at your osd tree -- I wonder if you've already realized that a 3 replica (host-wise) pool is going to be limited to < 0.8TB usable space. (The two 0.39999 hosts will fill up well before the two larger hosts are full). On Tue, Jul 26, 2016 at 1:55 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: > Hello Dan, > I increased choose_local_tries to 75 and the misplaced objects reduced > to 286. One more increase to 100 to get 141 misplaced objects and one > more to 125 for the cluster to fully recover! I also verified that I > can now down + out an OSD and the cluster will also fully recover. > > My problem is that this setting would not cross my mind ever. Even in > the docs, it is written for total_tries that "For extremely large > clusters, a larger value might be necessary.", but my cluster with 16 > OSDs and 40T of 13% utilization could not be considered such a cluster > (an extremely larger one). I also wonder what should be the value when > I will apply the tunables to my largest clusters with over 150 OSDs > and hundreds of TB... > > I would be grateful if you could point me to some code or > documentation (for this tunable and the others too also) that would > have make me "see" the problem earlier and make a plan for the future. > > Kostis > > > On 26 July 2016 at 12:42, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> Hi, >> >> Starting from the beginning... >> >> If a 3-replica PG gets stuck with only 2 replicas after changing >> tunables, it's probably a case where choose_total_tries is too low for >> your cluster configuration. >> Try increasing choose_total_tries from 50 to 75. >> >> -- Dan >> >> >> >> On Fri, Jul 22, 2016 at 4:17 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>> Hello, >>> being in latest Hammer, I think I hit a bug with more recent than >>> legacy tunables. >>> >>> Being in legacy tunables for a while, I decided to experiment with >>> "better" tunables. So first I went from argonaut profile to bobtail >>> and then to firefly. However, I decided to make the changes on >>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was >>> huge), from 5 down to the best value (1). So when I reached >>> chooseleaf_vary_r = 2, I decided to run a simple test before going to >>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster >>> recover. But the recovery never completes and a PG remains stuck, >>> reported as undersized+degraded. No OSD is near full and all pools >>> have min_size=1. >>> >>> ceph osd crush show-tunables -f json-pretty >>> >>> { >>> "choose_local_tries": 0, >>> "choose_local_fallback_tries": 0, >>> "choose_total_tries": 50, >>> "chooseleaf_descend_once": 1, >>> "chooseleaf_vary_r": 2, >>> "straw_calc_version": 1, >>> "allowed_bucket_algs": 22, >>> "profile": "unknown", >>> "optimal_tunables": 0, >>> "legacy_tunables": 0, >>> "require_feature_tunables": 1, >>> "require_feature_tunables2": 1, >>> "require_feature_tunables3": 1, >>> "has_v2_rules": 0, >>> "has_v3_rules": 0, >>> "has_v4_buckets": 0 >>> } >>> >>> The really strange thing is that the OSDs of the stuck PG belong to >>> other nodes than the one I decided to stop (osd.14). >>> >>> # ceph pg dump_stuck >>> ok >>> pg_stat state up up_primary acting acting_primary >>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2 >>> >>> >>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >>> -1 11.19995 root default >>> -3 11.19995 rack unknownrack >>> -2 0.39999 host staging-rd0-03 >>> 14 0.20000 osd.14 up 1.00000 1.00000 >>> 15 0.20000 osd.15 up 1.00000 1.00000 >>> -8 5.19998 host staging-rd0-01 >>> 6 0.59999 osd.6 up 1.00000 1.00000 >>> 7 0.59999 osd.7 up 1.00000 1.00000 >>> 8 1.00000 osd.8 up 1.00000 1.00000 >>> 9 1.00000 osd.9 up 1.00000 1.00000 >>> 10 1.00000 osd.10 up 1.00000 1.00000 >>> 11 1.00000 osd.11 up 1.00000 1.00000 >>> -7 5.19998 host staging-rd0-00 >>> 0 0.59999 osd.0 up 1.00000 1.00000 >>> 1 0.59999 osd.1 up 1.00000 1.00000 >>> 2 1.00000 osd.2 up 1.00000 1.00000 >>> 3 1.00000 osd.3 up 1.00000 1.00000 >>> 4 1.00000 osd.4 up 1.00000 1.00000 >>> 5 1.00000 osd.5 up 1.00000 1.00000 >>> -4 0.39999 host staging-rd0-02 >>> 12 0.20000 osd.12 up 1.00000 1.00000 >>> 13 0.20000 osd.13 up 1.00000 1.00000 >>> >>> >>> Have you experienced something similar? >>> >>> Regards, >>> Kostis >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com