crush hierarchy backwards and upmaps ...

Christopher Durham <caduceus42@xxxxxxx> · Mon, 10 Oct 2022 17:30:30 +0000 (UTC)

Hello,
I am using pacific 16.2.10 on Rocky 8.6 Linux.

After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr, I achieved a near perfect balance of PGs and space on my OSDs. This is great.

However, I started getting the following errors on my ceph-mon logs, every three minutes, for each of the OSDs that had been mappedby the balancer:
    2022-10-07T17:10:39.619+0000 7f7c2786d700 1 verify_upmap unable to get parent of osd.497, skipping for now

After banging my head against the wall for a bit trying to figure this out, I think I have discovered the issue:

Currently, I have my pool EC Pool configured with the following crush rule:

rule mypoolname {
    id -5
    type erasure
    step take myroot
    step choose indep 4 type rack
    step choose indep 2 type pod
    step chooseleaf indep 1 type host
    step emit
}

Basically, pick 4 racks, then 2 pods in each rack, and then one host in each pod, For a total of
8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the myroot root entry, which is as follows.

root myroot {
    id -400
    item rack1 weight N
    item rack2 weight N
    item rack3 weight N
    item rack4 weight N
}

This has worked fine since inception, over a year ago. And the PGs are all as I expect with OSDs from the 4 racks and not on the same host or pod.

The errors above, verify_upmap, started after I had the upmap_ max_deviation set to 1 in the balancer and having it
move things around, creating pg_upmap entries.

I then discovered, while trying to figure this out, that the device types are:

type 0 osd
type 1 host
type 2 chassis
type 3 rack
...
type 6 pod

So pod is HIGHER on the hierarchy than rack. I have it as lower on my rule. 

What I want to do is remove the pods completely to work around this. Something like:

rule mypoolname {
        id -5
        type erasure
        step take myroot
        step choose indep 4 type rack
        step chooseleaf indep 2 type host
        step emit
}

This will pick 4 racks and then 2 hosts in each rack. Will this cause any problems? I can add the pod stuff back later as 'chassis' instead. I can live without the 'pod' separation if needed.

To test this, I tried doing something like this:

1. grab the osdmap:
    ceph osd getmap -o /tmp/om
2. pull out the crushmap:
    osdmaptool --export-crush  /tmp/crush.bin
3. cnvert it to text:
    crushtool -d /tmp/crush.bin -o /tmp/crush.txt

I then edited the rule for this pool as above, to remove the pod and go directly
to pulling from 4 racks then 2 hosts in each rack. I then compiled up the crush map
and then imported it into the extracted osdmap:

    crushtool -c /tmp/crush.txt -o /tmp/crush.bin
    osdmaptool /tmp/om --import-crush /tmp/crush.bin

I then ran upmap-cleanup on the new osdmap:

    osdmaptool /tmp/om --upmap-cleanup

I did NOT get any of the verify_upmap messages (but it did generate some rm-pg-upmap-items and some new upmaps in the list of commands to execute). 

When I did the extraction of the osdmap WITHOUT any changes to it, and then ran the upmap-cleanup, I got the same verify_upmap errors I am now
seeing in the ceph-mon logs.

So, should I just change the crushmap to remove the wrong rack->pod->host hierarchy, making it rack->host ?
Will I have other issues? I am surprised that crush allowed me to create this out of order rule to begin with.

Thanks for any suggestions.

-Chris

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx