Re: crush hierarchy backwards and upmaps ...

Frank Schilder <frans@xxxxxx> · Wed, 12 Oct 2022 13:44:07 +0000

Hi Dan,

your comment is very important: https://tracker.ceph.com/issues/57348

By the way, is anyone looking at new cases? I submitted a couple since spring and in the past it took not more than 2 weeks to 1 month until someone assigned them to a project. Since spring I can't see any such activity any more. The issue tracker seems to have turned into a black hole. Do you know what the reason might be?

thanks and bets regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dvanders@xxxxxxxxx>
Sent: 11 October 2022 19:39:11
To: Christopher Durham
Cc: Ceph Users
Subject:  Re: crush hierarchy backwards and upmaps ...

Hi Chris,

Just curious, does this rule make sense and help with the multi level crush
map issue?
(Maybe it also results in zero movement, or at least less then the
alternative you proposed?)

    step choose indep 4 type rack
    step chooseleaf indep 2 type chassis

Cheers, Dan

On Tue, Oct 11, 2022, 19:29 Christopher Durham <caduceus42@xxxxxxx> wrote:

> Dan,
>
> Thank you.
>
> I did what you said regarding --test-map-pgs-dump and it wants to move 3
> OSDs in every PG. Yuk.
>
> So before i do that, I tried this rule, after changing all my 'pod' bucket
> definitions to 'chassis', and compiling and
> injecting the new crushmap to an osdmap:
>
>
> rule mypoolname {
>     id -5
>     type erasure
>     step take myroot
>     step choose indep 4 type rack
>     step choose indep 2 type chassis
>     step chooseleaf indep 1 type host
>     step emit
>
> }
>
> --test-pg-upmap-entries shows there were NO changes to be done after
> comparing it with the original!!!
>
> However, --upmap-cleanup says:
>
> verify_upmap number of buckets 8 exceeds desired number of 2
> check_pg_upmaps verify_upmap of poolid.pgid returning -22
>
> This is output for every current upmap, but I really do want 8 total
> buckets per PG, as my pool is a 6+2.
>
> The upmap-cleanup output wants me to remove all of my upmaps.
>
> This seems consistent with a bug report that says that there is a problem
> with the balancer on a
> multi-level rule such as the above, albeit on 14.2.x. Any thoughts?
>
> https://tracker.ceph.com/issues/51729
>
> I am leaning towards just eliminating the middle rule and go directly from
> rack to host, even though
> it wants to move a LARGE amount of data according to  a diff before and
> after of --test-pg-upmap-entries.
> In this scenario, I dont see any unexpected errors with --upmap-cleanup
> and I do not want to get stuck
>
> rule mypoolname {
>     id -5
>     type erasure
>     step take myroot
>     step choose indep 4 type rack
>     step chooseleaf indep 2 type host
>     step emit
> }
>
> -Chris
>
>
> -----Original Message-----
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> To: Christopher Durham <caduceus42@xxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Sent: Mon, Oct 10, 2022 12:22 pm
> Subject:  Re: crush hierarchy backwards and upmaps ...
>
> Hi,
>
> Here's a similar bug: https://tracker.ceph.com/issues/47361
>
> Back then, upmap would generate mappings that invalidate the crush rule. I
> don't know if that is still the case, but indeed you'll want to correct
> your rule.
>
> Something else you can do before applying the new crush map is use
> osdmaptool to compare the PGs placement before and after, something like:
>
> osdmaptool --test-map-pgs-dump osdmap.before > before.txt
>
> osdmaptool --test-map-pgs-dump osdmap.after > after.txt
>
> diff -u before.txt after.txt
>
> The above will help you estimate how much data will move after injecting
> the fixed crush map. So depending on the impact you can schedule the change
> appropriately.
>
> I also recommend to keep a backup of the previous crushmap so that you can
> quickly restore it if anything goes wrong.
>
> Cheers, Dan
>
>
>
>
>
> On Mon, Oct 10, 2022, 19:31 Christopher Durham <caduceus42@xxxxxxx> wrote:
>
> > Hello,
> > I am using pacific 16.2.10 on Rocky 8.6 Linux.
> >
> > After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr,
> I
> > achieved a near perfect balance of PGs and space on my OSDs. This is
> great.
> >
> > However, I started getting the following errors on my ceph-mon logs,
> every
> > three minutes, for each of the OSDs that had been mappedby the balancer:
> >    2022-10-07T17:10:39.619+0000 7f7c2786d700 1 verify_upmap unable to get
> > parent of osd.497, skipping for now
> >
> > After banging my head against the wall for a bit trying to figure this
> > out, I think I have discovered the issue:
> >
> > Currently, I have my pool EC Pool configured with the following crush
> rule:
> >
> > rule mypoolname {
> >    id -5
> >    type erasure
> >    step take myroot
> >    step choose indep 4 type rack
> >    step choose indep 2 type pod
> >    step chooseleaf indep 1 type host
> >    step emit
> > }
> >
> > Basically, pick 4 racks, then 2 pods in each rack, and then one host in
> > each pod, For a total of
> > 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the
> myroot
> > root entry, which is as follows.
> >
> >
> > root myroot {
> >    id -400
> >    item rack1 weight N
> >    item rack2 weight N
> >    item rack3 weight N
> >    item rack4 weight N
> > }
> >
> > This has worked fine since inception, over a year ago. And the PGs are
> all
> > as I expect with OSDs from the 4 racks and not on the same host or pod.
> >
> > The errors above, verify_upmap, started after I had the upmap_
> > max_deviation set to 1 in the balancer and having it
> > move things around, creating pg_upmap entries.
> >
> > I then discovered, while trying to figure this out, that the device types
> > are:
> >
> > type 0 osd
> > type 1 host
> > type 2 chassis
> > type 3 rack
> > ...
> > type 6 pod
> >
> > So pod is HIGHER on the hierarchy than rack. I have it as lower on my
> > rule.
> >
> > What I want to do is remove the pods completely to work around this.
> > Something like:
> >
> > rule mypoolname {
> >        id -5
> >        type erasure
> >        step take myroot
> >        step choose indep 4 type rack
> >        step chooseleaf indep 2 type host
> >        step emit
> > }
> >
> > This will pick 4 racks and then 2 hosts in each rack. Will this cause any
> > problems? I can add the pod stuff back later as 'chassis' instead. I can
> > live without the 'pod' separation if needed.
> >
> > To test this, I tried doing something like this:
> >
> > 1. grab the osdmap:
> >    ceph osd getmap -o /tmp/om
> > 2. pull out the crushmap:
> >    osdmaptool --export-crush  /tmp/crush.bin
> > 3. cnvert it to text:
> >    crushtool -d /tmp/crush.bin -o /tmp/crush.txt
> >
> > I then edited the rule for this pool as above, to remove the pod and go
> > directly
> > to pulling from 4 racks then 2 hosts in each rack. I then compiled up the
> > crush map
> > and then imported it into the extracted osdmap:
> >
> >    crushtool -c /tmp/crush.txt -o /tmp/crush.bin
> >    osdmaptool /tmp/om --import-crush /tmp/crush.bin
> >
> > I then ran upmap-cleanup on the new osdmap:
> >
> >    osdmaptool /tmp/om --upmap-cleanup
> >
> > I did NOT get any of the verify_upmap messages (but it did generate some
> > rm-pg-upmap-items and some new upmaps in the list of commands to
> execute).
> >
> > When I did the extraction of the osdmap WITHOUT any changes to it, and
> > then ran the upmap-cleanup, I got the same verify_upmap errors I am now
> > seeing in the ceph-mon logs.
> >
> > So, should I just change the crushmap to remove the wrong rack->pod->host
> > hierarchy, making it rack->host ?
> > Will I have other issues? I am surprised that crush allowed me to create
> > this out of order rule to begin with.
> >
> > Thanks for any suggestions.
> >
> > -Chris
>
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx