Re: crush hierarchy backwards and upmaps ...

Dan van der Ster <dvanders@xxxxxxxxx> · Fri, 14 Oct 2022 10:25:40 +0200

Hi,

On Thu, Oct 13, 2022 at 8:14 PM Christopher Durham <caduceus42@xxxxxxx> wrote:
>
>
> Dan,
>
> Again i am using 16.2.10 on rocky 8
>
> I decided to take a step back and check a variety of options before I do anything. Here are my results.
>
> If I use this rule:
>
> rule mypoolname {
>      id -5
>     type erasure
>     step take myroot
>     step choose indep 4 type rack
>     step choose indep 2 type chassis
>     step chooseleaf indep 1 type host
>     step emit
>
> This is changing the pod definitions all to type chassis. I get NO moves when running osdmaptool --test-pg-upmap-items
> and comparing to the current. But --upmap-cleanup gives:
>
> check_pg_upmaps verify upmap of pool.pgid returning -22
> verify_upmap number of buckets 8 exceeds desired 2
>
> for each of my existing upmaps. And it wants to remove them all.
>
> If I use the rule:
>
> rule mypoolname {
>      id -5
>     type erasure
>     step take myroot
>     step choose indep 4 type rack
>     step chooseleaf indep 2 type chassis
>     step emit
>
> I get almost 1/2 my data moving as per osdmaptool --test-pg-upmap-items.
>
> With --upmap-cleanup I get:
>
> verify_upmap multiple osds N,M come from the same failure domain -382
> check_pg_upmap verify upmap of pg poolid.pgid returning -22.
>
> For about 1/8 of my upmaps. And it wants to remove these and and add about 100 more.
> Although I suspect that this will be rectified after things are moved and such. Am I correct?

Yes that should be corrected as you suspect. The tracker I linked you
too earlier was indeed pointing out that with unorded crush buckets,
the balancer would create some rules which break the crush failure
domains.
It's good that those are detected and cleaned now -- leaving them in
would lead to unexpected cluster outages (e.g. rebooting a "pod" would
take down a PG because more than the expected 2 shards would have been
present).

> If I use the rule: (after changing my rack definition to only contain hosts that were previously a part of the
> pods or chassis):
>
> rule mypoolname {
>      id -5
>     type erasure
>     step take myroot
>     step choose indep 4 type rack
>     step chooseleaf indep 2 type host
>     step emit
>
> I get almost all my data moving as per osdmaptool --test-pg-upmap-items.
>
> With --upmap-cleanup, I get only 10 of these:
>
> verify_upmap multiple osds N,M come from the same failure domain -382
> check_pg_upmap verify upmap of pg poolid.pgid returning -22.
>
> But upmap-cleanup wants to remove all my upmaps, which may actually make sense if we
> redo the entire map this way.
>
> I am curious for the first rule, where I am getting the expected 8 got 2, if I am hitting this bug, that seems to
> suggest that I am having a problem because I have a multi-level (>2) level rule  for an ec pool:
>
> https://tracker.ceph.com/issues/51729
>
>
> This bug appears to be on 14.x, but perhaps it exists on pacific as well.....

There appears to be no progress on that bug for a year -- there's no
reason to think pacific has fixed it, and your observations seems to
confirm that.
I suggest you post to that ticket with your info.

Cheers, Dan

> It would be great if I could use the first rule, except for this bug. Perhaps the second rule is best at this point.
>
> Any other thoughts would be appreciated.
>
> -Chris
>
>
> -----Original Message-----
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> To: Christopher Durham <caduceus42@xxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxx>
> Sent: Tue, Oct 11, 2022 11:39 am
> Subject:  Re: crush hierarchy backwards and upmaps ...
>
> Hi Chris,
>
> Just curious, does this rule make sense and help with the multi level crush
> map issue?
> (Maybe it also results in zero movement, or at least less then the
> alternative you proposed?)
>
>     step choose indep 4 type rack
>     step chooseleaf indep 2 type chassis
>
> Cheers, Dan
>
>
>
>
> On Tue, Oct 11, 2022, 19:29 Christopher Durham <caduceus42@xxxxxxx> wrote:
>
> > Dan,
> >
> > Thank you.
> >
> > I did what you said regarding --test-map-pgs-dump and it wants to move 3
> > OSDs in every PG. Yuk.
> >
> > So before i do that, I tried this rule, after changing all my 'pod' bucket
> > definitions to 'chassis', and compiling and
> > injecting the new crushmap to an osdmap:
> >
> >
> > rule mypoolname {
> >    id -5
> >    type erasure
> >    step take myroot
> >    step choose indep 4 type rack
> >    step choose indep 2 type chassis
> >    step chooseleaf indep 1 type host
> >    step emit
> >
> > }
> >
> > --test-pg-upmap-entries shows there were NO changes to be done after
> > comparing it with the original!!!
> >
> > However, --upmap-cleanup says:
> >
> > verify_upmap number of buckets 8 exceeds desired number of 2
> > check_pg_upmaps verify_upmap of poolid.pgid returning -22
> >
> > This is output for every current upmap, but I really do want 8 total
> > buckets per PG, as my pool is a 6+2.
> >
> > The upmap-cleanup output wants me to remove all of my upmaps.
> >
> > This seems consistent with a bug report that says that there is a problem
> > with the balancer on a
> > multi-level rule such as the above, albeit on 14.2.x. Any thoughts?
> >
> > https://tracker.ceph.com/issues/51729
> >
> > I am leaning towards just eliminating the middle rule and go directly from
> > rack to host, even though
> > it wants to move a LARGE amount of data according to  a diff before and
> > after of --test-pg-upmap-entries.
> > In this scenario, I dont see any unexpected errors with --upmap-cleanup
> > and I do not want to get stuck
> >
> > rule mypoolname {
> >    id -5
> >    type erasure
> >    step take myroot
> >    step choose indep 4 type rack
> >    step chooseleaf indep 2 type host
> >    step emit
> > }
> >
> > -Chris
> >
> >
> > -----Original Message-----
> > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > To: Christopher Durham <caduceus42@xxxxxxx>
> > Cc: Ceph Users <ceph-users@xxxxxxx>
> > Sent: Mon, Oct 10, 2022 12:22 pm
> > Subject:  Re: crush hierarchy backwards and upmaps ...
> >
> > Hi,
> >
> > Here's a similar bug: https://tracker.ceph.com/issues/47361
> >
> > Back then, upmap would generate mappings that invalidate the crush rule. I
> > don't know if that is still the case, but indeed you'll want to correct
> > your rule.
> >
> > Something else you can do before applying the new crush map is use
> > osdmaptool to compare the PGs placement before and after, something like:
> >
> > osdmaptool --test-map-pgs-dump osdmap.before > before.txt
> >
> > osdmaptool --test-map-pgs-dump osdmap.after > after.txt
> >
> > diff -u before.txt after.txt
> >
> > The above will help you estimate how much data will move after injecting
> > the fixed crush map. So depending on the impact you can schedule the change
> > appropriately.
> >
> > I also recommend to keep a backup of the previous crushmap so that you can
> > quickly restore it if anything goes wrong.
> >
> > Cheers, Dan
> >
> >
> >
> >
> >
> > On Mon, Oct 10, 2022, 19:31 Christopher Durham <caduceus42@xxxxxxx> wrote:
> >
> > > Hello,
> > > I am using pacific 16.2.10 on Rocky 8.6 Linux.
> > >
> > > After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr,
> > I
> > > achieved a near perfect balance of PGs and space on my OSDs. This is
> > great.
> > >
> > > However, I started getting the following errors on my ceph-mon logs,
> > every
> > > three minutes, for each of the OSDs that had been mappedby the balancer:
> > >    2022-10-07T17:10:39.619+0000 7f7c2786d700 1 verify_upmap unable to get
> > > parent of osd.497, skipping for now
> > >
> > > After banging my head against the wall for a bit trying to figure this
> > > out, I think I have discovered the issue:
> > >
> > > Currently, I have my pool EC Pool configured with the following crush
> > rule:
> > >
> > > rule mypoolname {
> > >    id -5
> > >    type erasure
> > >    step take myroot
> > >    step choose indep 4 type rack
> > >    step choose indep 2 type pod
> > >    step chooseleaf indep 1 type host
> > >    step emit
> > > }
> > >
> > > Basically, pick 4 racks, then 2 pods in each rack, and then one host in
> > > each pod, For a total of
> > > 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the
> > myroot
> > > root entry, which is as follows.
> > >
> > >
> > > root myroot {
> > >    id -400
> > >    item rack1 weight N
> > >    item rack2 weight N
> > >    item rack3 weight N
> > >    item rack4 weight N
> > > }
> > >
> > > This has worked fine since inception, over a year ago. And the PGs are
> > all
> > > as I expect with OSDs from the 4 racks and not on the same host or pod.
> > >
> > > The errors above, verify_upmap, started after I had the upmap_
> > > max_deviation set to 1 in the balancer and having it
> > > move things around, creating pg_upmap entries.
> > >
> > > I then discovered, while trying to figure this out, that the device types
> > > are:
> > >
> > > type 0 osd
> > > type 1 host
> > > type 2 chassis
> > > type 3 rack
> > > ...
> > > type 6 pod
> > >
> > > So pod is HIGHER on the hierarchy than rack. I have it as lower on my
> > > rule.
> > >
> > > What I want to do is remove the pods completely to work around this.
> > > Something like:
> > >
> > > rule mypoolname {
> > >        id -5
> > >        type erasure
> > >        step take myroot
> > >        step choose indep 4 type rack
> > >        step chooseleaf indep 2 type host
> > >        step emit
> > > }
> > >
> > > This will pick 4 racks and then 2 hosts in each rack. Will this cause any
> > > problems? I can add the pod stuff back later as 'chassis' instead. I can
> > > live without the 'pod' separation if needed.
> > >
> > > To test this, I tried doing something like this:
> > >
> > > 1. grab the osdmap:
> > >    ceph osd getmap -o /tmp/om
> > > 2. pull out the crushmap:
> > >    osdmaptool --export-crush  /tmp/crush.bin
> > > 3. cnvert it to text:
> > >    crushtool -d /tmp/crush.bin -o /tmp/crush.txt
> > >
> > > I then edited the rule for this pool as above, to remove the pod and go
> > > directly
> > > to pulling from 4 racks then 2 hosts in each rack. I then compiled up the
> > > crush map
> > > and then imported it into the extracted osdmap:
> > >
> > >    crushtool -c /tmp/crush.txt -o /tmp/crush.bin
> > >    osdmaptool /tmp/om --import-crush /tmp/crush.bin
> > >
> > > I then ran upmap-cleanup on the new osdmap:
> > >
> > >    osdmaptool /tmp/om --upmap-cleanup
> > >
> > > I did NOT get any of the verify_upmap messages (but it did generate some
> > > rm-pg-upmap-items and some new upmaps in the list of commands to
> > execute).
> > >
> > > When I did the extraction of the osdmap WITHOUT any changes to it, and
> > > then ran the upmap-cleanup, I got the same verify_upmap errors I am now
> > > seeing in the ceph-mon logs.
> > >
> > > So, should I just change the crushmap to remove the wrong rack->pod->host
> > > hierarchy, making it rack->host ?
> > > Will I have other issues? I am surprised that crush allowed me to create
> > > this out of order rule to begin with.
> > >
> > > Thanks for any suggestions.
> > >
> > > -Chris
> >
> > >
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx