Re: crush hierarchy backwards and upmaps ...

Christopher Durham <caduceus42@xxxxxxx> · Tue, 11 Oct 2022 17:29:54 +0000 (UTC)

Dan,
Thank you.
I did what you said regarding --test-map-pgs-dump and it wants to move 3 OSDs in every PG. Yuk.
So before i do that, I tried this rule, after changing all my 'pod' bucket definitions to 'chassis', and compiling andinjecting the new crushmap to an osdmap:

rule mypoolname {
    id -5
    type erasure
    step take myroot
    step choose indep 4 type rack
    step choose indep 2 type chassis
    step chooseleaf indep 1 type host
    step emit 
 }
--test-pg-upmap-entries shows there were NO changes to be done after comparing it with the original!!!
However, --upmap-cleanup says:
verify_upmap number of buckets 8 exceeds desired number of 2check_pg_upmaps verify_upmap of poolid.pgid returning -22
This is output for every current upmap, but I really do want 8 total buckets per PG, as my pool is a 6+2. 

The upmap-cleanup output wants me to remove all of my upmaps.

This seems consistent with a bug report that says that there is a problem with the balancer on a 
multi-level rule such as the above, albeit on 14.2.x. Any thoughts?

https://tracker.ceph.com/issues/51729

I am leaning towards just eliminating the middle rule and go directly from rack to host, even thoughit wants to move a LARGE amount of data according to  a diff before and after of --test-pg-upmap-entries.In this scenario, I dont see any unexpected errors with --upmap-cleanup and I do not want to get stuck

rule mypoolname {
    id -5
    type erasure
    step take myroot
    step choose indep 4 type rack
    step chooseleaf indep 2 type host
    step emit }
-Chris

-----Original Message-----
From: Dan van der Ster <dvanders@xxxxxxxxx>
To: Christopher Durham <caduceus42@xxxxxxx>
Cc: Ceph Users <ceph-users@xxxxxxx>
Sent: Mon, Oct 10, 2022 12:22 pm
Subject:  Re: crush hierarchy backwards and upmaps ...

Hi,

Here's a similar bug: https://tracker.ceph.com/issues/47361

Back then, upmap would generate mappings that invalidate the crush rule. I
don't know if that is still the case, but indeed you'll want to correct
your rule.

Something else you can do before applying the new crush map is use
osdmaptool to compare the PGs placement before and after, something like:

osdmaptool --test-map-pgs-dump osdmap.before > before.txt

osdmaptool --test-map-pgs-dump osdmap.after > after.txt

diff -u before.txt after.txt

The above will help you estimate how much data will move after injecting
the fixed crush map. So depending on the impact you can schedule the change
appropriately.

I also recommend to keep a backup of the previous crushmap so that you can
quickly restore it if anything goes wrong.

Cheers, Dan

On Mon, Oct 10, 2022, 19:31 Christopher Durham <caduceus42@xxxxxxx> wrote:

> Hello,
> I am using pacific 16.2.10 on Rocky 8.6 Linux.
>
> After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr, I
> achieved a near perfect balance of PGs and space on my OSDs. This is great.
>
> However, I started getting the following errors on my ceph-mon logs, every
> three minutes, for each of the OSDs that had been mappedby the balancer:
>    2022-10-07T17:10:39.619+0000 7f7c2786d700 1 verify_upmap unable to get
> parent of osd.497, skipping for now
>
> After banging my head against the wall for a bit trying to figure this
> out, I think I have discovered the issue:
>
> Currently, I have my pool EC Pool configured with the following crush rule:
>
> rule mypoolname {
>    id -5
>    type erasure
>    step take myroot
>    step choose indep 4 type rack
>    step choose indep 2 type pod
>    step chooseleaf indep 1 type host
>    step emit
> }
>
> Basically, pick 4 racks, then 2 pods in each rack, and then one host in
> each pod, For a total of
> 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the myroot
> root entry, which is as follows.
>
>
> root myroot {
>    id -400
>    item rack1 weight N
>    item rack2 weight N
>    item rack3 weight N
>    item rack4 weight N
> }
>
> This has worked fine since inception, over a year ago. And the PGs are all
> as I expect with OSDs from the 4 racks and not on the same host or pod.
>
> The errors above, verify_upmap, started after I had the upmap_
> max_deviation set to 1 in the balancer and having it
> move things around, creating pg_upmap entries.
>
> I then discovered, while trying to figure this out, that the device types
> are:
>
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> ...
> type 6 pod
>
> So pod is HIGHER on the hierarchy than rack. I have it as lower on my
> rule.
>
> What I want to do is remove the pods completely to work around this.
> Something like:
>
> rule mypoolname {
>        id -5
>        type erasure
>        step take myroot
>        step choose indep 4 type rack
>        step chooseleaf indep 2 type host
>        step emit
> }
>
> This will pick 4 racks and then 2 hosts in each rack. Will this cause any
> problems? I can add the pod stuff back later as 'chassis' instead. I can
> live without the 'pod' separation if needed.
>
> To test this, I tried doing something like this:
>
> 1. grab the osdmap:
>    ceph osd getmap -o /tmp/om
> 2. pull out the crushmap:
>    osdmaptool --export-crush  /tmp/crush.bin
> 3. cnvert it to text:
>    crushtool -d /tmp/crush.bin -o /tmp/crush.txt
>
> I then edited the rule for this pool as above, to remove the pod and go
> directly
> to pulling from 4 racks then 2 hosts in each rack. I then compiled up the
> crush map
> and then imported it into the extracted osdmap:
>
>    crushtool -c /tmp/crush.txt -o /tmp/crush.bin
>    osdmaptool /tmp/om --import-crush /tmp/crush.bin
>
> I then ran upmap-cleanup on the new osdmap:
>
>    osdmaptool /tmp/om --upmap-cleanup
>
> I did NOT get any of the verify_upmap messages (but it did generate some
> rm-pg-upmap-items and some new upmaps in the list of commands to execute).
>
> When I did the extraction of the osdmap WITHOUT any changes to it, and
> then ran the upmap-cleanup, I got the same verify_upmap errors I am now
> seeing in the ceph-mon logs.
>
> So, should I just change the crushmap to remove the wrong rack->pod->host
> hierarchy, making it rack->host ?
> Will I have other issues? I am surprised that crush allowed me to create
> this out of order rule to begin with.
>
> Thanks for any suggestions.
>
> -Chris
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx