Re: Recovery stuck after adjusting to recent tunables

Kostis Fardelas <dante1234@xxxxxxxxx> · Tue, 26 Jul 2016 14:55:04 +0300

Hello Dan,
I increased choose_local_tries to 75 and the misplaced objects reduced
to 286. One more increase to 100 to get 141 misplaced objects and one
more to 125 for the cluster to fully recover! I also verified that I
can now down + out an OSD and the cluster will also fully recover.

My problem is that this setting would not cross my mind ever. Even in
the docs, it is written for total_tries that "For extremely large
clusters, a larger value might be necessary.", but my cluster with 16
OSDs and 40T of 13% utilization could not be considered such a cluster
(an extremely larger one). I also wonder what should be the value when
I will apply the tunables to my largest clusters with over 150 OSDs
and hundreds of TB...

I would be grateful if you could point me to some code or
documentation (for this tunable and the others too also) that would
have make me "see" the problem earlier and make a plan for the future.

Kostis

On 26 July 2016 at 12:42, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Hi,
>
> Starting from the beginning...
>
> If a 3-replica PG gets stuck with only 2 replicas after changing
> tunables, it's probably a case where choose_total_tries is too low for
> your cluster configuration.
> Try increasing choose_total_tries from 50 to 75.
>
> -- Dan
>
>
>
> On Fri, Jul 22, 2016 at 4:17 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>> Hello,
>> being in latest Hammer, I think I hit a bug with more recent than
>> legacy tunables.
>>
>> Being in legacy tunables for a while, I decided to experiment with
>> "better" tunables. So first I went from argonaut profile to bobtail
>> and then to firefly. However, I decided to make the changes on
>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was
>> huge), from 5 down to the best value (1). So when I reached
>> chooseleaf_vary_r = 2, I decided to run a simple test before going to
>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster
>> recover. But the recovery never completes and a PG remains stuck,
>> reported as undersized+degraded. No OSD is near full and all pools
>> have min_size=1.
>>
>> ceph osd crush show-tunables -f json-pretty
>>
>> {
>>     "choose_local_tries": 0,
>>     "choose_local_fallback_tries": 0,
>>     "choose_total_tries": 50,
>>     "chooseleaf_descend_once": 1,
>>     "chooseleaf_vary_r": 2,
>>     "straw_calc_version": 1,
>>     "allowed_bucket_algs": 22,
>>     "profile": "unknown",
>>     "optimal_tunables": 0,
>>     "legacy_tunables": 0,
>>     "require_feature_tunables": 1,
>>     "require_feature_tunables2": 1,
>>     "require_feature_tunables3": 1,
>>     "has_v2_rules": 0,
>>     "has_v3_rules": 0,
>>     "has_v4_buckets": 0
>> }
>>
>> The really strange thing is that the OSDs of the stuck PG belong to
>> other nodes than the one I decided to stop (osd.14).
>>
>> # ceph pg dump_stuck
>> ok
>> pg_stat state up up_primary acting acting_primary
>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2
>>
>>
>> ID WEIGHT   TYPE NAME                   UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 11.19995 root default
>> -3 11.19995     rack unknownrack
>> -2  0.39999         host staging-rd0-03
>> 14  0.20000             osd.14               up  1.00000          1.00000
>> 15  0.20000             osd.15               up  1.00000          1.00000
>> -8  5.19998         host staging-rd0-01
>>  6  0.59999             osd.6                up  1.00000          1.00000
>>  7  0.59999             osd.7                up  1.00000          1.00000
>>  8  1.00000             osd.8                up  1.00000          1.00000
>>  9  1.00000             osd.9                up  1.00000          1.00000
>> 10  1.00000             osd.10               up  1.00000          1.00000
>> 11  1.00000             osd.11               up  1.00000          1.00000
>> -7  5.19998         host staging-rd0-00
>>  0  0.59999             osd.0                up  1.00000          1.00000
>>  1  0.59999             osd.1                up  1.00000          1.00000
>>  2  1.00000             osd.2                up  1.00000          1.00000
>>  3  1.00000             osd.3                up  1.00000          1.00000
>>  4  1.00000             osd.4                up  1.00000          1.00000
>>  5  1.00000             osd.5                up  1.00000          1.00000
>> -4  0.39999         host staging-rd0-02
>> 12  0.20000             osd.12               up  1.00000          1.00000
>> 13  0.20000             osd.13               up  1.00000          1.00000
>>
>>
>> Have you experienced something similar?
>>
>> Regards,
>> Kostis
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com