Re: unknown PGs after adding hosts in different subtree

Eugen Block <eblock@xxxxxx> · Thu, 23 May 2024 15:57:34 +0000

So this is the current status after adding two hosts outside of their rooms:

ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
 -1         0.37054  root default
-23         0.04678      host host5
 14    hdd  0.02339          osd.14          up   1.00000  1.00000
 15    hdd  0.02339          osd.15          up   1.00000  1.00000
-12         0.04678      host host6
  1    hdd  0.02339          osd.1           up   1.00000  1.00000
 13    hdd  0.02339          osd.13          up   1.00000  1.00000
 -8         0.09399      room room1
 -3         0.04700          host host1
  7    hdd  0.02299              osd.7       up   1.00000  1.00000
 10    hdd  0.02299              osd.10      up   1.00000  1.00000
 -5         0.04700          host host2
  4    hdd  0.02299              osd.4       up   1.00000  1.00000
 11    hdd  0.02299              osd.11      up   1.00000  1.00000
 -9         0.09299      room room2
-17         0.04599          host host7
  0    hdd  0.02299              osd.0       up   1.00000  1.00000
  2    hdd  0.02299              osd.2       up   1.00000  1.00000
 -7         0.04700          host host8
  5    hdd  0.02299              osd.5       up   1.00000  1.00000
  6    hdd  0.02299              osd.6       up   1.00000  1.00000
-21         0.09000      room room3
-11         0.04300          host host3
  8    hdd  0.01900              osd.8       up   1.00000  1.00000
  9    hdd  0.02299              osd.9       up   1.00000  1.00000
-15         0.04700          host host4
  3    hdd  0.02299              osd.3       up   1.00000  1.00000
 12    hdd  0.02299              osd.12      up   1.00000  1.00000

And the current ceph status:

# ceph -s
  cluster:
    id:     543967bc-e586-32b8-bd2c-2d8b8b168f02
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum host1,host2,host3 (age 5d)
    mgr: host8.psefrq(active, since 76m), standbys: host4.frkktj, host1.vhylmr
    mds: 2/2 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 69m), 16 in (since 70m); 89 remapped pgs
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 2/2 healthy
    pools:   15 pools, 350 pgs
    objects: 576 objects, 341 MiB
    usage:   61 GiB used, 319 GiB / 380 GiB avail
    pgs:     256/2013 objects misplaced (12.717%)
             262 active+clean
             88  active+clean+remapped

I attached my osdmap, not sure if it will go through, though. Let me  
know if you need anything else.

Thanks!
Eugen

Zitat von Eugen Block <eblock@xxxxxx>:

In my small lab cluster I can at least reproduce that a bunch of PGs  
are remapped after adding hosts to the default root, but they are  
not in their designated location yet. I have 3 „rooms“ underneath  
the default root. Although I can’t reproduce the unknown PGs, maybe  
this is enough to investigate? I’m on my mobile right now, I’ll add  
my own osdmap to the thread soon.

Zitat von Eugen Block <eblock@xxxxxx>:

Thanks, Frank, I appreciate your help.
I already asked for the osdmap, but I’ll also try to find a reproducer.

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen,

thanks for this clarification. Yes, with the observations you  
describe for transition 1->2, something is very wrong. Nothing  
should happen. Unfortunately, I'm going to be on holidays and,  
generally, don't have too much time. If they can afford to share  
the osdmap (ceph osd getmap -o file), I could also take a look at  
some point.

I don't think it has to do with set_choose_tries, there is likely  
something else screwed up badly. There should simply not be any  
remapping going on at this stage. Just for fun, you should be able  
to produce a clean crushmap from scratch with a similar or the  
same tree and check if you see the same problems.

Using the full osdmap with osdmaptool allows to reproduce the  
exact mappings as used in the cluster and it encodes other  
important information as well. That's why I'm asking for this  
instead of just the crush map.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Thursday, May 23, 2024 1:26 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: unknown PGs after adding hosts in  
different subtree

Hi Frank,

thanks for chiming in here.

Please correct if this is wrong. Assuming its correct, I conclude
the following.

You assume correctly.

Now, from your description it is not clear to me on which of the
transitions 1->2 or 2->3 you observe
- peering and/or
- unknown PGs.

The unknown PGs were observed during/after 1 -> 2. All or almost all
PGs were reported as "remapped", I don't remember the exact number,
but it was more than 4k, and the largest pool has 4096 PGs. We didn't
see down OSDs at all.
Only after moving the hosts into their designated location (the DCs)
the unknown PGs cleared and the application resumed its operation.

I don't want to overload this thread but I asked for a copy of their
crushmap to play around a bit. I moved the new hosts out of the DCs
into the default root via 'crushtool --move ...', then running the
crushtool --test command

# crushtool -i crushmap --test --rule 1 --num-rep 18
--show-choose-tries [--show-bad-mappings] --show-utilization

results in a couple of issues:

- there are lots of bad mappings no matter how high the number for
set_choose_tries is set
- the show-utilization output shows 240 OSDs in usage (there were 240
OSDs before the expansion), but plenty of them have only 9 chunks
assigned:

rule 1 (rule-ec-k7m11), x = 0..1023, numrep = 18..18
rule 1 (rule-ec-k7m11) num_rep 18 result size == 0:     55/1024
rule 1 (rule-ec-k7m11) num_rep 18 result size == 9:     488/1024
rule 1 (rule-ec-k7m11) num_rep 18 result size == 18:    481/1024

And this reminds me of the inactive PGs we saw before I failed the
mgr, those inactive PGs showed only 9 chunks in the acting set. With
k=7 (and min_size=8) that should still be enough, we have successfully
tested disaster recovery with one entire DC down multiple times.

- with --show-mappings some lines contain an empty set like this:

CRUSH rule 1 x 22 []

And one more observation: with the currently active crushmap there are
no bad mappings at all when the hosts are in their designated location.
So there's definitely something wrong here, I just can't tell what it
is yet. I'll play a bit more with that crushmap...

Thanks!
Eugen

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen,

I'm afraid the description of your observation breaks a bit with
causality and this might be the reason for the few replies. To
produce a bit more structure for when exactly what happened, let's
look at what I did and didn't get:

Before adding the hosts you have situation

1)
default
DCA
  host A1 ... AN
DCB
  host B1 ... BM

Now you add K+L hosts, they go into the default root and we have situation

2)
default
host C1 ... CK, D1 ... DL
DCA
  host A1 ... AN
DCB
  host B1 ... BM

As a last step, you move the hosts to their final locations and we
arrive at situation

3)
default
DCA
  host A1 ... AN, C1 ... CK
DCB
  host B1 ... BM, D1 ... DL

Please correct if this is wrong. Assuming its correct, I conclude
the following.

Now, from your description it is not clear to me on which of the
transitions 1->2 or 2->3 you observe
- peering and/or
- unknown PGs.

We use a somewhat similar procedure except that we have a second
root (separate disjoint tree) for new hosts/OSDs. However, in terms
of peering it is the same and if everything is configured correctly
I would expect this to happen (this is what happens when we add
OSDs/hosts):

transition 1->2: hosts get added: no peering, no remapped objects,
nothing, just new OSDs doing nothing
transition 2->3: hosts get moved: peering starts and remapped
objects appear, all PGs active+clean

Unknown PGs should not occur (maybe only temporarily when the
primary changes or the PG is slow to respond/report status??). The
crush bug with too few set_choose_tries is observed if one has *just
enough hosts* for the EC profile and should not be observed if all
PGs are active+clean and one *adds hosts*. Persistent unknown PGs
can (to my understanding, does unknown mean "has no primary"?) only
occur if the number of PGs changes (autoscaler messing around??)
because all PGs were active+clean before. The crush bug leads to
incomplete PGs, so PGs can go incomplete but they should always have
an acting primary.

This is assuming no OSDs went down/out during the process.

Can you please check if my interpretation is correct and describe at
which step exactly things start diverging from my expectations.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Thursday, May 23, 2024 12:05 PM
To: ceph-users@xxxxxxx
Subject:  Re: unknown PGs after adding hosts in  
different subtree

Hi again,

I'm still wondering if I misunderstand some of the ceph concepts.
Let's assume the choose_tries value is too low and ceph can't find
enough OSDs for the remapping. I would expect that there are some PG
chunks in remapping state or unknown or whatever, but why would it
affect the otherwise healthy cluster in such a way?
Even if ceph doesn't know where to put some of the chunks, I wouldn't
expect inactive PGs and have a service interruption.
What am I missing here?

Thanks,
Eugen

Zitat von Eugen Block <eblock@xxxxxx>:

Thanks, Konstantin.
It's been a while since I was last bitten by the choose_tries being
too low... Unfortunately, I won't be able to verify that... But I'll
definitely keep that in mind, or least I'll try to. :-D

Thanks!

Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:

Hi Eugen

On 21 May 2024, at 15:26, Eugen Block <eblock@xxxxxx> wrote:

step set_choose_tries 100

I think you should try to increase set_choose_tries to 200
Last year we had an Pacific EC 8+2 deployment of 10 racks. And even
with 50 hosts, the value of 100 not worked for us

k

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx