Re: Recovery stuck after adjusting to recent tunables

Kostis Fardelas <dante1234@xxxxxxxxx> · Tue, 26 Jul 2016 17:15:20 +0300

I will catch up with the issues in the ML and hopefully with the code.
Yes the 2 nodes are very different from the other 2, we are in the
middle of restructuring this cluster thus the irregularity.

Thanks a lot Dan

On 26 July 2016 at 15:25, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Cool, glad that worked. You'll have to read backwards in the ML to
> find this discussed -- though it is rarely needed, therefore rarely
> discussed. For code, it's used in src/crush/mapper.c.
>
> Most clusters, irrespective of size, work with 50 tries. Clusters that
> need more than 50 tries usually have some irregularity to their CRUSH
> tree -- in your case it's 2 big hosts, 2 small hosts. For a 3 replica
> PG, the CRUSH algorithm makes random tries to find 3 unique OSDs to
> satisfy the CRUSH rule. But some times 50 tries isn't enough... it
> just needs a few more to find that elusive 3rd replica.
>
> The signature of this issue is as you saw -- 3 replica pool, but a PG
> is stuck with only 2 up/acting OSDs.
>
> Regards,
>
> Dan
>
> P.S. Looking again at your osd tree -- I wonder if you've already
> realized that a 3 replica (host-wise) pool is going to be limited to <
> 0.8TB usable space. (The two 0.39999 hosts will fill up well before
> the two larger hosts are full).
>
>
> On Tue, Jul 26, 2016 at 1:55 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>> Hello Dan,
>> I increased choose_local_tries to 75 and the misplaced objects reduced
>> to 286. One more increase to 100 to get 141 misplaced objects and one
>> more to 125 for the cluster to fully recover! I also verified that I
>> can now down + out an OSD and the cluster will also fully recover.
>>
>> My problem is that this setting would not cross my mind ever. Even in
>> the docs, it is written for total_tries that "For extremely large
>> clusters, a larger value might be necessary.", but my cluster with 16
>> OSDs and 40T of 13% utilization could not be considered such a cluster
>> (an extremely larger one). I also wonder what should be the value when
>> I will apply the tunables to my largest clusters with over 150 OSDs
>> and hundreds of TB...
>>
>> I would be grateful if you could point me to some code or
>> documentation (for this tunable and the others too also) that would
>> have make me "see" the problem earlier and make a plan for the future.
>>
>> Kostis
>>
>>
>> On 26 July 2016 at 12:42, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> Starting from the beginning...
>>>
>>> If a 3-replica PG gets stuck with only 2 replicas after changing
>>> tunables, it's probably a case where choose_total_tries is too low for
>>> your cluster configuration.
>>> Try increasing choose_total_tries from 50 to 75.
>>>
>>> -- Dan
>>>
>>>
>>>
>>> On Fri, Jul 22, 2016 at 4:17 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>>>> Hello,
>>>> being in latest Hammer, I think I hit a bug with more recent than
>>>> legacy tunables.
>>>>
>>>> Being in legacy tunables for a while, I decided to experiment with
>>>> "better" tunables. So first I went from argonaut profile to bobtail
>>>> and then to firefly. However, I decided to make the changes on
>>>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was
>>>> huge), from 5 down to the best value (1). So when I reached
>>>> chooseleaf_vary_r = 2, I decided to run a simple test before going to
>>>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster
>>>> recover. But the recovery never completes and a PG remains stuck,
>>>> reported as undersized+degraded. No OSD is near full and all pools
>>>> have min_size=1.
>>>>
>>>> ceph osd crush show-tunables -f json-pretty
>>>>
>>>> {
>>>>     "choose_local_tries": 0,
>>>>     "choose_local_fallback_tries": 0,
>>>>     "choose_total_tries": 50,
>>>>     "chooseleaf_descend_once": 1,
>>>>     "chooseleaf_vary_r": 2,
>>>>     "straw_calc_version": 1,
>>>>     "allowed_bucket_algs": 22,
>>>>     "profile": "unknown",
>>>>     "optimal_tunables": 0,
>>>>     "legacy_tunables": 0,
>>>>     "require_feature_tunables": 1,
>>>>     "require_feature_tunables2": 1,
>>>>     "require_feature_tunables3": 1,
>>>>     "has_v2_rules": 0,
>>>>     "has_v3_rules": 0,
>>>>     "has_v4_buckets": 0
>>>> }
>>>>
>>>> The really strange thing is that the OSDs of the stuck PG belong to
>>>> other nodes than the one I decided to stop (osd.14).
>>>>
>>>> # ceph pg dump_stuck
>>>> ok
>>>> pg_stat state up up_primary acting acting_primary
>>>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2
>>>>
>>>>
>>>> ID WEIGHT   TYPE NAME                   UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>> -1 11.19995 root default
>>>> -3 11.19995     rack unknownrack
>>>> -2  0.39999         host staging-rd0-03
>>>> 14  0.20000             osd.14               up  1.00000          1.00000
>>>> 15  0.20000             osd.15               up  1.00000          1.00000
>>>> -8  5.19998         host staging-rd0-01
>>>>  6  0.59999             osd.6                up  1.00000          1.00000
>>>>  7  0.59999             osd.7                up  1.00000          1.00000
>>>>  8  1.00000             osd.8                up  1.00000          1.00000
>>>>  9  1.00000             osd.9                up  1.00000          1.00000
>>>> 10  1.00000             osd.10               up  1.00000          1.00000
>>>> 11  1.00000             osd.11               up  1.00000          1.00000
>>>> -7  5.19998         host staging-rd0-00
>>>>  0  0.59999             osd.0                up  1.00000          1.00000
>>>>  1  0.59999             osd.1                up  1.00000          1.00000
>>>>  2  1.00000             osd.2                up  1.00000          1.00000
>>>>  3  1.00000             osd.3                up  1.00000          1.00000
>>>>  4  1.00000             osd.4                up  1.00000          1.00000
>>>>  5  1.00000             osd.5                up  1.00000          1.00000
>>>> -4  0.39999         host staging-rd0-02
>>>> 12  0.20000             osd.12               up  1.00000          1.00000
>>>> 13  0.20000             osd.13               up  1.00000          1.00000
>>>>
>>>>
>>>> Have you experienced something similar?
>>>>
>>>> Regards,
>>>> Kostis
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com