Hey Janek, Ah, yes, we ran into that invalid json output in https://github.com/digitalocean/ceph_exporter as well. I have a patch I wrote for ceph_exporter that I can port over to pgremapper (that does similar to what your patch does). Josh On Tue, Dec 17, 2024 at 9:38 AM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > Looks like there is something wrong with the .mgr pool. All other have > proper values. For now I've patched the pgremapper source code to > replace the inf values with 0 before unmarshaling the JSON. That at > least made the tool work. I guess it's safe to just delete that pool and > let the MGRs recreate it?? (is it?) > > > On 17/12/2024 17:01, Janek Bevendorff wrote: > > I checked the ceph osd dump json-pretty output and validated it with a > > little Python script. Turns out, there's this somewhere around line 1200: > > > > "read_balance": { > > "score_acting": inf, > > "score_stable": inf, > > "optimal_score": 0, > > "raw_score_acting": 3, > > "raw_score_stable": 3, > > "primary_affinity_weighted": 0.9999845027923584, > > "average_primary_affinity": 1, > > "average_primary_affinity_weighted": 1 > > } > > > > > > The inf values seem to be the problem. These are the only two invalid > > JSON values in the whole file. Do you happen to know how I can > > debug/fix this? > > > > > > On 17/12/2024 16:17, Janek Bevendorff wrote: > >> Thanks. I tried running the command (dry run for now), but > >> something's not working as expected. Have you ever seen this? > >> > >> $ /root/go/bin/pgremapper cancel-backfill --verbose > >> ** executing: ceph osd dump -f json > >> panic: invalid character 'i' looking for beginning of value > >> > >> goroutine 1 [running]: > >> main.mustParseCephCommand({0xc000b00000?, 0x0?}, {0x0?, 0x0?}, > >> {0x59c9c0?, 0xc00011af30?}) > >> /root/go/pkg/mod/github.com/digitalocean/pgremapper@v0.0.0-20240313130618-268522c0f6d5/ceph.go:743 > >> +0xe6 > >> main.osdDump() > >> /root/go/pkg/mod/github.com/digitalocean/pgremapper@v0.0.0-20240313130618-268522c0f6d5/ceph.go:517 > >> +0x53 > >> main.mustGetCurrentMappingState() > >> /root/go/pkg/mod/github.com/digitalocean/pgremapper@v0.0.0-20240313130618-268522c0f6d5/mappingstate.go:54 > >> +0x1d > >> main.glob..func9(0x73f540?, {0x5dabc9?, 0x1?, 0x1?}) > >> /root/go/pkg/mod/github.com/digitalocean/pgremapper@v0.0.0-20240313130618-268522c0f6d5/main.go:133 > >> +0x1ef > >> github.com/spf13/cobra.(*Command).execute(0x73f540, {0xc0001188a0, > >> 0x1, 0x1}) > >> /root/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856 +0x663 > >> github.com/spf13/cobra.(*Command).ExecuteC(0x73f040) > >> /root/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960 +0x39c > >> github.com/spf13/cobra.(*Command).Execute(...) > >> /root/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897 > >> main.main() > >> /root/go/pkg/mod/github.com/digitalocean/pgremapper@v0.0.0-20240313130618-268522c0f6d5/main.go:740 > >> +0x25 > >> > >> > >> Somehow it's choking here while trying to dumping OSDs: > >> https://github.com/digitalocean/pgremapper/blob/main/ceph.go#L741 > >> > >> There isn't an issue report about this. > >> > >> > >> > >> On 17/12/2024 15:59, Janne Johansson wrote: > >>>> You can use pg-remapper > >>>> (https://github.com/digitalocean/pgremapper) or > >>>> similar tools to cancel the remapping; up-map entries will be created > >>>> that reflect the current state of the cluster. After all currently > >>>> running backfills are finished your mons should not be blocked > >>>> anymore. > >>>> I would also disable the balancer temporarily since it will trigger > >>>> new > >>>> backfills for those PG that are not at their optimal locations. After > >>>> mons are fine again you can just enable the balancer. This requires a > >>>> ceph release and ceph clients with up-map support. > >>>> Not tested in real life, but this approach might work. > >>> We use that approach at times, just so that there isn't a long long > >>> queue of PGs in the remapped state, > >>> and as far as I can tell, it is quite safe, You just programmatically > >>> tell each PG that there is an upmap entry > >>> for it telling it to be exactly where it is now, and then it isn't > >>> "misplaced" anymore. When you enable the balancer > >>> it will take a percentage of these and just remove their individual > >>> upmap entry, and they start to move as needed. > >>> If you want to have a small movement, set the max balancer to a really > >>> low value, and few PGs will be moving at > >>> the same time. If your wpq/mclock settings work ok for you, you can > >>> have a large percentage and let the IO > >>> scheduler prioritize for you. But as Burkhard says, setting > >>> "norebalance" for a moment, having the balancer > >>> disabled and then running one of these tools once or twice will make > >>> all PGs active+clean where they are, > >>> even if that isn't the desired end location for them. This should help > >>> your mons a lot, then enable the balancer > >>> and unset "norebalance" and let it finish the last PGs you have in the > >>> wrong spot. > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx