Following up, I increased pg_num/pgp_num for my 3-replica pool to 128 (being in argonaut tunables) and after a small recovery that followed, I switched to bobtail tunables. Remapping started and got stuck (!) again without any OSD down this time with 1 PG active+remapped. Tried restarting PG's OSDs, no luck. One thing to notice is that stuck PGs are always on this 3-replicated pool. Finally, I decided to take the hit and switch to firefly tunables (with chooseleaf_vary_r=1) just for the sake of it. Misplaced objects are on 51% of the cluster right now, so I am going to wait and update our thread with the outcome when the dust settles down. All in all, even if firefly tunables lead to a healthy PG distribution, I am afraid I am going to stick with argonaut tunables for now and on, the experience was far from encouraging and there is little documentation regarding the cons and pros of profile tunables changes and their impact on a production cluster. Kostis On 24 July 2016 at 14:29, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: > nice to hear from you Goncalo, > what you propose sounds like an interesting theory, I will test it > tomorrow and let you know. In the meanwhile, I did the same test with > bobtail and argonaut tunables: > - with argonaut tunables, the recovery completes to the end > - with bobtail tunables, the situation is worse than with firefly - I > got even more degraded and misplaced objects and recovery stuck across > 6 PGs > > I also fell upon a thread with an almost similar case [1], where Sage > recommends to switch to hammer tunables and straw2 algorithm, but this > is not an option for a lot of people due to kernel requirements > > [1] https://www.spinics.net/lists/ceph-devel/msg30381.html > > > On 24 July 2016 at 03:44, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: >> Hi Kostis >> This is a wild guess but one thing I note is that your pool 179 has a very low pg number (100). >> >> Maybe the algorithm behind the new tunable need a higher pg number to actually proceed with the recovery? >> >> You could try to increase the pgs to 128 (it is always better to use powers of 2) and see if the recover completes.. >> >> Cheers >> G. >> ________________________________________ >> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx] >> Sent: 23 July 2016 16:32 >> To: Brad Hubbard >> Cc: ceph-users >> Subject: Re: Recovery stuck after adjusting to recent tunables >> >> Hi Brad, >> >> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 3 object_hash >> rjenkins pg_num 2048 pgp_num 2048 last_change 119047 >> crash_replay_interval 45 stripe_width 0 >> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 3 >> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119048 >> stripe_width 0 >> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash >> rjenkins pg_num 2048 pgp_num 2048 last_change 119049 stripe_width 0 >> pool 3 'blocks' replicated size 2 min_size 1 crush_ruleset 4 >> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119050 >> stripe_width 0 >> pool 4 'maps' replicated size 2 min_size 1 crush_ruleset 3 object_hash >> rjenkins pg_num 2048 pgp_num 2048 last_change 119051 stripe_width 0 >> pool 179 'scbench' replicated size 3 min_size 1 crush_ruleset 0 >> object_hash rjenkins pg_num 100 pgp_num 100 last_change 154034 flags >> hashpspool stripe_width 0 >> >> This is the status of 179.38 when the cluster is healthy: >> http://pastebin.ca/3663600 >> >> and this is when recovery is stuck: >> http://pastebin.ca/3663601 >> >> >> It seems that the PG is replicated with size 3 but the cluster cannot >> create the third replica for some objects whose third OSD (OSD.14) is >> down. That was not the case with argonaut tunables as I remember. >> >> Regards >> >> >> On 23 July 2016 at 06:16, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: >>> On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>>> Hello, >>>> being in latest Hammer, I think I hit a bug with more recent than >>>> legacy tunables. >>>> >>>> Being in legacy tunables for a while, I decided to experiment with >>>> "better" tunables. So first I went from argonaut profile to bobtail >>>> and then to firefly. However, I decided to make the changes on >>>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was >>>> huge), from 5 down to the best value (1). So when I reached >>>> chooseleaf_vary_r = 2, I decided to run a simple test before going to >>>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster >>>> recover. But the recovery never completes and a PG remains stuck, >>>> reported as undersized+degraded. No OSD is near full and all pools >>>> have min_size=1. >>>> >>>> ceph osd crush show-tunables -f json-pretty >>>> >>>> { >>>> "choose_local_tries": 0, >>>> "choose_local_fallback_tries": 0, >>>> "choose_total_tries": 50, >>>> "chooseleaf_descend_once": 1, >>>> "chooseleaf_vary_r": 2, >>>> "straw_calc_version": 1, >>>> "allowed_bucket_algs": 22, >>>> "profile": "unknown", >>>> "optimal_tunables": 0, >>>> "legacy_tunables": 0, >>>> "require_feature_tunables": 1, >>>> "require_feature_tunables2": 1, >>>> "require_feature_tunables3": 1, >>>> "has_v2_rules": 0, >>>> "has_v3_rules": 0, >>>> "has_v4_buckets": 0 >>>> } >>>> >>>> The really strange thing is that the OSDs of the stuck PG belong to >>>> other nodes than the one I decided to stop (osd.14). >>>> >>>> # ceph pg dump_stuck >>>> ok >>>> pg_stat state up up_primary acting acting_primary >>>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2 >>> >>> Can you share a query of this pg? >>> >>> What size (not min size) is this pool (assuming it's 2)? >>> >>>> >>>> >>>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >>>> -1 11.19995 root default >>>> -3 11.19995 rack unknownrack >>>> -2 0.39999 host staging-rd0-03 >>>> 14 0.20000 osd.14 up 1.00000 1.00000 >>>> 15 0.20000 osd.15 up 1.00000 1.00000 >>>> -8 5.19998 host staging-rd0-01 >>>> 6 0.59999 osd.6 up 1.00000 1.00000 >>>> 7 0.59999 osd.7 up 1.00000 1.00000 >>>> 8 1.00000 osd.8 up 1.00000 1.00000 >>>> 9 1.00000 osd.9 up 1.00000 1.00000 >>>> 10 1.00000 osd.10 up 1.00000 1.00000 >>>> 11 1.00000 osd.11 up 1.00000 1.00000 >>>> -7 5.19998 host staging-rd0-00 >>>> 0 0.59999 osd.0 up 1.00000 1.00000 >>>> 1 0.59999 osd.1 up 1.00000 1.00000 >>>> 2 1.00000 osd.2 up 1.00000 1.00000 >>>> 3 1.00000 osd.3 up 1.00000 1.00000 >>>> 4 1.00000 osd.4 up 1.00000 1.00000 >>>> 5 1.00000 osd.5 up 1.00000 1.00000 >>>> -4 0.39999 host staging-rd0-02 >>>> 12 0.20000 osd.12 up 1.00000 1.00000 >>>> 13 0.20000 osd.13 up 1.00000 1.00000 >>>> >>>> >>>> Have you experienced something similar? >>>> >>>> Regards, >>>> Kostis >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> -- >>> Cheers, >>> Brad >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com