Re: Recovery stuck after adjusting to recent tunables

Brad Hubbard <bhubbard@xxxxxxxxxx> · Tue, 26 Jul 2016 09:07:42 +1000

On Tue, Jul 26, 2016 at 6:08 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
> Following up, I increased pg_num/pgp_num for my 3-replica pool to 128

These pg numbers seem low.

Can you take a look at http://ceph.com/pgcalc/ and verify these values
are appropriate for your environment and use case?

I'd also take a good look at your crush rules to determine if they are
contributing to the problem.

> (being in argonaut tunables) and after a small recovery that followed,
> I switched to bobtail tunables. Remapping started and got stuck (!)
> again without any OSD down this time with 1 PG active+remapped. Tried
> restarting PG's OSDs, no luck.
>
> One thing to notice is that stuck PGs are always on this 3-replicated pool.
>
> Finally, I decided to take the hit and switch to firefly tunables
> (with chooseleaf_vary_r=1) just for the sake of it. Misplaced objects
> are on 51% of the cluster right now, so I am going to wait and update
> our thread with the outcome when the dust settles down.
>
> All in all, even if firefly tunables lead to a healthy PG
> distribution, I am afraid I am going to stick with argonaut tunables
> for now and on, the experience was far from encouraging and there is
> little documentation regarding the cons and pros of profile tunables
> changes and their impact on a production cluster.
>
> Kostis
>
> On 24 July 2016 at 14:29, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>> nice to hear from you Goncalo,
>> what you propose sounds like an interesting theory, I will test it
>> tomorrow and let you know. In the meanwhile, I did the same test with
>> bobtail and argonaut tunables:
>> - with argonaut tunables, the recovery completes to the end
>> - with bobtail tunables, the situation is worse than with firefly - I
>> got even more degraded and misplaced objects and recovery stuck across
>> 6 PGs
>>
>> I also fell upon a thread with an almost similar case [1], where Sage
>> recommends to switch to hammer tunables and straw2 algorithm, but this
>> is not an option for a lot of people due to kernel requirements
>>
>> [1] https://www.spinics.net/lists/ceph-devel/msg30381.html
>>
>>
>> On 24 July 2016 at 03:44, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote:
>>> Hi Kostis
>>> This is a wild guess but one thing I note is that your pool 179 has a very low pg number (100).
>>>
>>> Maybe the algorithm behind the new tunable need a higher pg number to actually proceed with the recovery?
>>>
>>> You could try to increase the pgs to 128 (it is always better to use powers of 2) and see if the recover completes..
>>>
>>> Cheers
>>> G.
>>> ________________________________________
>>> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx]
>>> Sent: 23 July 2016 16:32
>>> To: Brad Hubbard
>>> Cc: ceph-users
>>> Subject: Re:  Recovery stuck after adjusting to recent tunables
>>>
>>> Hi Brad,
>>>
>>> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 3 object_hash
>>> rjenkins pg_num 2048 pgp_num 2048 last_change 119047
>>> crash_replay_interval 45 stripe_width 0
>>> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 3
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119048
>>> stripe_width 0
>>> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash
>>> rjenkins pg_num 2048 pgp_num 2048 last_change 119049 stripe_width 0
>>> pool 3 'blocks' replicated size 2 min_size 1 crush_ruleset 4
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119050
>>> stripe_width 0
>>> pool 4 'maps' replicated size 2 min_size 1 crush_ruleset 3 object_hash
>>> rjenkins pg_num 2048 pgp_num 2048 last_change 119051 stripe_width 0
>>> pool 179 'scbench' replicated size 3 min_size 1 crush_ruleset 0
>>> object_hash rjenkins pg_num 100 pgp_num 100 last_change 154034 flags
>>> hashpspool stripe_width 0
>>>
>>> This is the status of 179.38 when the cluster is healthy:
>>> http://pastebin.ca/3663600
>>>
>>> and this is when recovery is stuck:
>>> http://pastebin.ca/3663601
>>>
>>>
>>> It seems that the PG is replicated with size 3 but the cluster cannot
>>> create the third replica for some objects whose third OSD (OSD.14) is
>>> down. That was not the case with argonaut tunables as I remember.
>>>
>>> Regards
>>>
>>>
>>> On 23 July 2016 at 06:16, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>>>> On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>>>>> Hello,
>>>>> being in latest Hammer, I think I hit a bug with more recent than
>>>>> legacy tunables.
>>>>>
>>>>> Being in legacy tunables for a while, I decided to experiment with
>>>>> "better" tunables. So first I went from argonaut profile to bobtail
>>>>> and then to firefly. However, I decided to make the changes on
>>>>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was
>>>>> huge), from 5 down to the best value (1). So when I reached
>>>>> chooseleaf_vary_r = 2, I decided to run a simple test before going to
>>>>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster
>>>>> recover. But the recovery never completes and a PG remains stuck,
>>>>> reported as undersized+degraded. No OSD is near full and all pools
>>>>> have min_size=1.
>>>>>
>>>>> ceph osd crush show-tunables -f json-pretty
>>>>>
>>>>> {
>>>>>     "choose_local_tries": 0,
>>>>>     "choose_local_fallback_tries": 0,
>>>>>     "choose_total_tries": 50,
>>>>>     "chooseleaf_descend_once": 1,
>>>>>     "chooseleaf_vary_r": 2,
>>>>>     "straw_calc_version": 1,
>>>>>     "allowed_bucket_algs": 22,
>>>>>     "profile": "unknown",
>>>>>     "optimal_tunables": 0,
>>>>>     "legacy_tunables": 0,
>>>>>     "require_feature_tunables": 1,
>>>>>     "require_feature_tunables2": 1,
>>>>>     "require_feature_tunables3": 1,
>>>>>     "has_v2_rules": 0,
>>>>>     "has_v3_rules": 0,
>>>>>     "has_v4_buckets": 0
>>>>> }
>>>>>
>>>>> The really strange thing is that the OSDs of the stuck PG belong to
>>>>> other nodes than the one I decided to stop (osd.14).
>>>>>
>>>>> # ceph pg dump_stuck
>>>>> ok
>>>>> pg_stat state up up_primary acting acting_primary
>>>>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2
>>>>
>>>> Can you share a query of this pg?
>>>>
>>>> What size (not min size) is this pool (assuming it's 2)?
>>>>
>>>>>
>>>>>
>>>>> ID WEIGHT   TYPE NAME                   UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>>> -1 11.19995 root default
>>>>> -3 11.19995     rack unknownrack
>>>>> -2  0.39999         host staging-rd0-03
>>>>> 14  0.20000             osd.14               up  1.00000          1.00000
>>>>> 15  0.20000             osd.15               up  1.00000          1.00000
>>>>> -8  5.19998         host staging-rd0-01
>>>>>  6  0.59999             osd.6                up  1.00000          1.00000
>>>>>  7  0.59999             osd.7                up  1.00000          1.00000
>>>>>  8  1.00000             osd.8                up  1.00000          1.00000
>>>>>  9  1.00000             osd.9                up  1.00000          1.00000
>>>>> 10  1.00000             osd.10               up  1.00000          1.00000
>>>>> 11  1.00000             osd.11               up  1.00000          1.00000
>>>>> -7  5.19998         host staging-rd0-00
>>>>>  0  0.59999             osd.0                up  1.00000          1.00000
>>>>>  1  0.59999             osd.1                up  1.00000          1.00000
>>>>>  2  1.00000             osd.2                up  1.00000          1.00000
>>>>>  3  1.00000             osd.3                up  1.00000          1.00000
>>>>>  4  1.00000             osd.4                up  1.00000          1.00000
>>>>>  5  1.00000             osd.5                up  1.00000          1.00000
>>>>> -4  0.39999         host staging-rd0-02
>>>>> 12  0.20000             osd.12               up  1.00000          1.00000
>>>>> 13  0.20000             osd.13               up  1.00000          1.00000
>>>>>
>>>>>
>>>>> Have you experienced something similar?
>>>>>
>>>>> Regards,
>>>>> Kostis
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> --
>>>> Cheers,
>>>> Brad
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com