On Tue, Jul 26, 2016 at 6:08 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: > Following up, I increased pg_num/pgp_num for my 3-replica pool to 128 These pg numbers seem low. Can you take a look at http://ceph.com/pgcalc/ and verify these values are appropriate for your environment and use case? I'd also take a good look at your crush rules to determine if they are contributing to the problem. > (being in argonaut tunables) and after a small recovery that followed, > I switched to bobtail tunables. Remapping started and got stuck (!) > again without any OSD down this time with 1 PG active+remapped. Tried > restarting PG's OSDs, no luck. > > One thing to notice is that stuck PGs are always on this 3-replicated pool. > > Finally, I decided to take the hit and switch to firefly tunables > (with chooseleaf_vary_r=1) just for the sake of it. Misplaced objects > are on 51% of the cluster right now, so I am going to wait and update > our thread with the outcome when the dust settles down. > > All in all, even if firefly tunables lead to a healthy PG > distribution, I am afraid I am going to stick with argonaut tunables > for now and on, the experience was far from encouraging and there is > little documentation regarding the cons and pros of profile tunables > changes and their impact on a production cluster. > > Kostis > > On 24 July 2016 at 14:29, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >> nice to hear from you Goncalo, >> what you propose sounds like an interesting theory, I will test it >> tomorrow and let you know. In the meanwhile, I did the same test with >> bobtail and argonaut tunables: >> - with argonaut tunables, the recovery completes to the end >> - with bobtail tunables, the situation is worse than with firefly - I >> got even more degraded and misplaced objects and recovery stuck across >> 6 PGs >> >> I also fell upon a thread with an almost similar case [1], where Sage >> recommends to switch to hammer tunables and straw2 algorithm, but this >> is not an option for a lot of people due to kernel requirements >> >> [1] https://www.spinics.net/lists/ceph-devel/msg30381.html >> >> >> On 24 July 2016 at 03:44, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: >>> Hi Kostis >>> This is a wild guess but one thing I note is that your pool 179 has a very low pg number (100). >>> >>> Maybe the algorithm behind the new tunable need a higher pg number to actually proceed with the recovery? >>> >>> You could try to increase the pgs to 128 (it is always better to use powers of 2) and see if the recover completes.. >>> >>> Cheers >>> G. >>> ________________________________________ >>> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Kostis Fardelas [dante1234@xxxxxxxxx] >>> Sent: 23 July 2016 16:32 >>> To: Brad Hubbard >>> Cc: ceph-users >>> Subject: Re: Recovery stuck after adjusting to recent tunables >>> >>> Hi Brad, >>> >>> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 3 object_hash >>> rjenkins pg_num 2048 pgp_num 2048 last_change 119047 >>> crash_replay_interval 45 stripe_width 0 >>> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 3 >>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119048 >>> stripe_width 0 >>> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash >>> rjenkins pg_num 2048 pgp_num 2048 last_change 119049 stripe_width 0 >>> pool 3 'blocks' replicated size 2 min_size 1 crush_ruleset 4 >>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 119050 >>> stripe_width 0 >>> pool 4 'maps' replicated size 2 min_size 1 crush_ruleset 3 object_hash >>> rjenkins pg_num 2048 pgp_num 2048 last_change 119051 stripe_width 0 >>> pool 179 'scbench' replicated size 3 min_size 1 crush_ruleset 0 >>> object_hash rjenkins pg_num 100 pgp_num 100 last_change 154034 flags >>> hashpspool stripe_width 0 >>> >>> This is the status of 179.38 when the cluster is healthy: >>> http://pastebin.ca/3663600 >>> >>> and this is when recovery is stuck: >>> http://pastebin.ca/3663601 >>> >>> >>> It seems that the PG is replicated with size 3 but the cluster cannot >>> create the third replica for some objects whose third OSD (OSD.14) is >>> down. That was not the case with argonaut tunables as I remember. >>> >>> Regards >>> >>> >>> On 23 July 2016 at 06:16, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: >>>> On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>>>> Hello, >>>>> being in latest Hammer, I think I hit a bug with more recent than >>>>> legacy tunables. >>>>> >>>>> Being in legacy tunables for a while, I decided to experiment with >>>>> "better" tunables. So first I went from argonaut profile to bobtail >>>>> and then to firefly. However, I decided to make the changes on >>>>> chooseleaf_vary_r incrementally (because the remapping from 0 to 5 was >>>>> huge), from 5 down to the best value (1). So when I reached >>>>> chooseleaf_vary_r = 2, I decided to run a simple test before going to >>>>> chooseleaf_vary_r = 1: close an OSD (OSD.14) and let the cluster >>>>> recover. But the recovery never completes and a PG remains stuck, >>>>> reported as undersized+degraded. No OSD is near full and all pools >>>>> have min_size=1. >>>>> >>>>> ceph osd crush show-tunables -f json-pretty >>>>> >>>>> { >>>>> "choose_local_tries": 0, >>>>> "choose_local_fallback_tries": 0, >>>>> "choose_total_tries": 50, >>>>> "chooseleaf_descend_once": 1, >>>>> "chooseleaf_vary_r": 2, >>>>> "straw_calc_version": 1, >>>>> "allowed_bucket_algs": 22, >>>>> "profile": "unknown", >>>>> "optimal_tunables": 0, >>>>> "legacy_tunables": 0, >>>>> "require_feature_tunables": 1, >>>>> "require_feature_tunables2": 1, >>>>> "require_feature_tunables3": 1, >>>>> "has_v2_rules": 0, >>>>> "has_v3_rules": 0, >>>>> "has_v4_buckets": 0 >>>>> } >>>>> >>>>> The really strange thing is that the OSDs of the stuck PG belong to >>>>> other nodes than the one I decided to stop (osd.14). >>>>> >>>>> # ceph pg dump_stuck >>>>> ok >>>>> pg_stat state up up_primary acting acting_primary >>>>> 179.38 active+undersized+degraded [2,8] 2 [2,8] 2 >>>> >>>> Can you share a query of this pg? >>>> >>>> What size (not min size) is this pool (assuming it's 2)? >>>> >>>>> >>>>> >>>>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >>>>> -1 11.19995 root default >>>>> -3 11.19995 rack unknownrack >>>>> -2 0.39999 host staging-rd0-03 >>>>> 14 0.20000 osd.14 up 1.00000 1.00000 >>>>> 15 0.20000 osd.15 up 1.00000 1.00000 >>>>> -8 5.19998 host staging-rd0-01 >>>>> 6 0.59999 osd.6 up 1.00000 1.00000 >>>>> 7 0.59999 osd.7 up 1.00000 1.00000 >>>>> 8 1.00000 osd.8 up 1.00000 1.00000 >>>>> 9 1.00000 osd.9 up 1.00000 1.00000 >>>>> 10 1.00000 osd.10 up 1.00000 1.00000 >>>>> 11 1.00000 osd.11 up 1.00000 1.00000 >>>>> -7 5.19998 host staging-rd0-00 >>>>> 0 0.59999 osd.0 up 1.00000 1.00000 >>>>> 1 0.59999 osd.1 up 1.00000 1.00000 >>>>> 2 1.00000 osd.2 up 1.00000 1.00000 >>>>> 3 1.00000 osd.3 up 1.00000 1.00000 >>>>> 4 1.00000 osd.4 up 1.00000 1.00000 >>>>> 5 1.00000 osd.5 up 1.00000 1.00000 >>>>> -4 0.39999 host staging-rd0-02 >>>>> 12 0.20000 osd.12 up 1.00000 1.00000 >>>>> 13 0.20000 osd.13 up 1.00000 1.00000 >>>>> >>>>> >>>>> Have you experienced something similar? >>>>> >>>>> Regards, >>>>> Kostis >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >>>> -- >>>> Cheers, >>>> Brad >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com