Re: Recovery stuck after adjusting to recent tunables

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Sun, 24 Jul 2016 00:44:35 +0000

Hi Kostis
This is a wild guess but one thing I note is that your pool 179 has a very low pg number (100). 

Maybe the algorithm behind the new tunable need a higher pg number to actually proceed with the recovery? 

You could try to increase the pgs to 128 (it is always better to use powers of 2) and see if the recover completes..

Cheers
G.
________________________________________
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx]
Sent: 23 July 2016 16:32
To: Brad Hubbard
Cc: ceph-users
Subject: Re:  Recovery stuck after adjusting to recent tunables

Hi Brad,

pool 0 'data' replicated size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 2048 pgp_num 2048 last_change 119047
crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 3
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119048
stripe_width 0
pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 2048 pgp_num 2048 last_change 119049 stripe_width 0
pool 3 'blocks' replicated size 2 min_size 1 crush_ruleset 4
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119050
stripe_width 0
pool 4 'maps' replicated size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 2048 pgp_num 2048 last_change 119051 stripe_width 0
pool 179 'scbench' replicated size 3 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 100 pgp_num 100 last_change 154034 flags
hashpspool stripe_width 0

This is the status of 179.38 when the cluster is healthy:
http://pastebin.ca/3663600

and this is when recovery is stuck:
http://pastebin.ca/3663601

It seems that the PG is replicated with size 3 but the cluster cannot
create the third replica for some objects whose third OSD (OSD.14) is
down. That was not the case with argonaut tunables as I remember.

Regards

On 23 July 2016 at 06:16, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
> On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>> Hello,
>> being in latest Hammer, I think I hit a bug with more recent than
>> legacy tunables.
>>
>> Being in legacy tunables for a while, I decided to experiment with
>> "better" tunables. So first I went from argonaut profile to bobtail
>> and then to firefly. However, I decided to make the changes on
>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was
>> huge), from 5 down to the best value (1). So when I reached
>> chooseleaf_vary_r = 2, I decided to run a simple test before going to
>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster
>> recover. But the recovery never completes and a PG remains stuck,
>> reported as undersized+degraded. No OSD is near full and all pools
>> have min_size=1.
>>
>> ceph osd crush show-tunables -f json-pretty
>>
>> {
>>     "choose_local_tries": 0,
>>     "choose_local_fallback_tries": 0,
>>     "choose_total_tries": 50,
>>     "chooseleaf_descend_once": 1,
>>     "chooseleaf_vary_r": 2,
>>     "straw_calc_version": 1,
>>     "allowed_bucket_algs": 22,
>>     "profile": "unknown",
>>     "optimal_tunables": 0,
>>     "legacy_tunables": 0,
>>     "require_feature_tunables": 1,
>>     "require_feature_tunables2": 1,
>>     "require_feature_tunables3": 1,
>>     "has_v2_rules": 0,
>>     "has_v3_rules": 0,
>>     "has_v4_buckets": 0
>> }
>>
>> The really strange thing is that the OSDs of the stuck PG belong to
>> other nodes than the one I decided to stop (osd.14).
>>
>> # ceph pg dump_stuck
>> ok
>> pg_stat state up up_primary acting acting_primary
>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2
>
> Can you share a query of this pg?
>
> What size (not min size) is this pool (assuming it's 2)?
>
>>
>>
>> ID WEIGHT   TYPE NAME                   UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 11.19995 root default
>> -3 11.19995     rack unknownrack
>> -2  0.39999         host staging-rd0-03
>> 14  0.20000             osd.14               up  1.00000          1.00000
>> 15  0.20000             osd.15               up  1.00000          1.00000
>> -8  5.19998         host staging-rd0-01
>>  6  0.59999             osd.6                up  1.00000          1.00000
>>  7  0.59999             osd.7                up  1.00000          1.00000
>>  8  1.00000             osd.8                up  1.00000          1.00000
>>  9  1.00000             osd.9                up  1.00000          1.00000
>> 10  1.00000             osd.10               up  1.00000          1.00000
>> 11  1.00000             osd.11               up  1.00000          1.00000
>> -7  5.19998         host staging-rd0-00
>>  0  0.59999             osd.0                up  1.00000          1.00000
>>  1  0.59999             osd.1                up  1.00000          1.00000
>>  2  1.00000             osd.2                up  1.00000          1.00000
>>  3  1.00000             osd.3                up  1.00000          1.00000
>>  4  1.00000             osd.4                up  1.00000          1.00000
>>  5  1.00000             osd.5                up  1.00000          1.00000
>> -4  0.39999         host staging-rd0-02
>> 12  0.20000             osd.12               up  1.00000          1.00000
>> 13  0.20000             osd.13               up  1.00000          1.00000
>>
>>
>> Have you experienced something similar?
>>
>> Regards,
>> Kostis
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com