How many PGs in the pools? This maybe the CRUSH can not get the proper OSD. you can check param tunable choose_total_tries in crush tunables, try increase it like this: ceph osd getcrushmap -o crush crushtool -d crush -o crush.txt sed -i 's/tunable choose_total_tries 50/tunable choose_total_tries 150/g' crush.txt crushtool -c crush.txt -o crush.new ceph osd setcrushmap -i crush.new Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx> 于2019年11月23日周六 上午5:26写道: > > On 2019-11-22 21:45, Paul Emmerich wrote: > > On Fri, Nov 22, 2019 at 9:33 PM Zoltan Arnold Nagy > > <zoltan@xxxxxxxxxxxxxxxxxx> wrote: > > > >> The 2^31-1 in there seems to indicate an overflow somewhere - the way > >> we > >> were able to figure out where exactly > >> is to query the PG and compare the "up" and "acting" sets - only _one_ > >> of them had the 2^31-1 number in place > >> of the correct OSD number. We restarted that and the PG started doing > >> its job and recovered. > > > > no, this value is intentional (and shows up as 'None' on higher level > > tools), it means no mapping could be found > > thanks, didn't know. > > > check your crush map and crush rule > > if it were indeed a crush rule or map issue, it would not have been > resolved by just restarting the primary OSD of the PG, would it? > > the crush rule was created by running > > ceph osd erasure-code-profile set ec42 k=4 m=2 crush-device-class=nvme > > where the default failure domain is host; as I said we have 12 hosts, > so I don't see anything wrong here - it's all default... > > this is why I suspect a bug, just don't have any evidence other than > that it happened to us :) > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx