I will catch up with the issues in the ML and hopefully with the code. Yes the 2 nodes are very different from the other 2, we are in the middle of restructuring this cluster thus the irregularity. Thanks a lot Dan On 26 July 2016 at 15:25, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Cool, glad that worked. You'll have to read backwards in the ML to > find this discussed -- though it is rarely needed, therefore rarely > discussed. For code, it's used in src/crush/mapper.c. > > Most clusters, irrespective of size, work with 50 tries. Clusters that > need more than 50 tries usually have some irregularity to their CRUSH > tree -- in your case it's 2 big hosts, 2 small hosts. For a 3 replica > PG, the CRUSH algorithm makes random tries to find 3 unique OSDs to > satisfy the CRUSH rule. But some times 50 tries isn't enough... it > just needs a few more to find that elusive 3rd replica. > > The signature of this issue is as you saw -- 3 replica pool, but a PG > is stuck with only 2 up/acting OSDs. > > Regards, > > Dan > > P.S. Looking again at your osd tree -- I wonder if you've already > realized that a 3 replica (host-wise) pool is going to be limited to < > 0.8TB usable space. (The two 0.39999 hosts will fill up well before > the two larger hosts are full). > > > On Tue, Jul 26, 2016 at 1:55 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >> Hello Dan, >> I increased choose_local_tries to 75 and the misplaced objects reduced >> to 286. One more increase to 100 to get 141 misplaced objects and one >> more to 125 for the cluster to fully recover! I also verified that I >> can now down + out an OSD and the cluster will also fully recover. >> >> My problem is that this setting would not cross my mind ever. Even in >> the docs, it is written for total_tries that "For extremely large >> clusters, a larger value might be necessary.", but my cluster with 16 >> OSDs and 40T of 13% utilization could not be considered such a cluster >> (an extremely larger one). I also wonder what should be the value when >> I will apply the tunables to my largest clusters with over 150 OSDs >> and hundreds of TB... >> >> I would be grateful if you could point me to some code or >> documentation (for this tunable and the others too also) that would >> have make me "see" the problem earlier and make a plan for the future. >> >> Kostis >> >> >> On 26 July 2016 at 12:42, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>> Hi, >>> >>> Starting from the beginning... >>> >>> If a 3-replica PG gets stuck with only 2 replicas after changing >>> tunables, it's probably a case where choose_total_tries is too low for >>> your cluster configuration. >>> Try increasing choose_total_tries from 50 to 75. >>> >>> -- Dan >>> >>> >>> >>> On Fri, Jul 22, 2016 at 4:17 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>>> Hello, >>>> being in latest Hammer, I think I hit a bug with more recent than >>>> legacy tunables. >>>> >>>> Being in legacy tunables for a while, I decided to experiment with >>>> "better" tunables. So first I went from argonaut profile to bobtail >>>> and then to firefly. However, I decided to make the changes on >>>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was >>>> huge), from 5 down to the best value (1). So when I reached >>>> chooseleaf_vary_r = 2, I decided to run a simple test before going to >>>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster >>>> recover. But the recovery never completes and a PG remains stuck, >>>> reported as undersized+degraded. No OSD is near full and all pools >>>> have min_size=1. >>>> >>>> ceph osd crush show-tunables -f json-pretty >>>> >>>> { >>>> "choose_local_tries": 0, >>>> "choose_local_fallback_tries": 0, >>>> "choose_total_tries": 50, >>>> "chooseleaf_descend_once": 1, >>>> "chooseleaf_vary_r": 2, >>>> "straw_calc_version": 1, >>>> "allowed_bucket_algs": 22, >>>> "profile": "unknown", >>>> "optimal_tunables": 0, >>>> "legacy_tunables": 0, >>>> "require_feature_tunables": 1, >>>> "require_feature_tunables2": 1, >>>> "require_feature_tunables3": 1, >>>> "has_v2_rules": 0, >>>> "has_v3_rules": 0, >>>> "has_v4_buckets": 0 >>>> } >>>> >>>> The really strange thing is that the OSDs of the stuck PG belong to >>>> other nodes than the one I decided to stop (osd.14). >>>> >>>> # ceph pg dump_stuck >>>> ok >>>> pg_stat state up up_primary acting acting_primary >>>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2 >>>> >>>> >>>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >>>> -1 11.19995 root default >>>> -3 11.19995 rack unknownrack >>>> -2 0.39999 host staging-rd0-03 >>>> 14 0.20000 osd.14 up 1.00000 1.00000 >>>> 15 0.20000 osd.15 up 1.00000 1.00000 >>>> -8 5.19998 host staging-rd0-01 >>>> 6 0.59999 osd.6 up 1.00000 1.00000 >>>> 7 0.59999 osd.7 up 1.00000 1.00000 >>>> 8 1.00000 osd.8 up 1.00000 1.00000 >>>> 9 1.00000 osd.9 up 1.00000 1.00000 >>>> 10 1.00000 osd.10 up 1.00000 1.00000 >>>> 11 1.00000 osd.11 up 1.00000 1.00000 >>>> -7 5.19998 host staging-rd0-00 >>>> 0 0.59999 osd.0 up 1.00000 1.00000 >>>> 1 0.59999 osd.1 up 1.00000 1.00000 >>>> 2 1.00000 osd.2 up 1.00000 1.00000 >>>> 3 1.00000 osd.3 up 1.00000 1.00000 >>>> 4 1.00000 osd.4 up 1.00000 1.00000 >>>> 5 1.00000 osd.5 up 1.00000 1.00000 >>>> -4 0.39999 host staging-rd0-02 >>>> 12 0.20000 osd.12 up 1.00000 1.00000 >>>> 13 0.20000 osd.13 up 1.00000 1.00000 >>>> >>>> >>>> Have you experienced something similar? >>>> >>>> Regards, >>>> Kostis >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com