Re: PG down, due to 3 OSD failing

Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> · Mon, 4 Apr 2022 10:59:02 +0200

Yesss! Fixing the choose/chooseleaf thing did make the magic.  :-)

  Thanks a lot for your support Dan. Lots of lessons learned from my 
side, I'm really grateful.

  All PGs are now active, will let Ceph rebalance.

  Ciao ciao

			Fulvio

On 4/4/22 10:50, Dan van der Ster wrote:
Hi Fulvio,

Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch!
I suggest to fix it and inject the new crush map and see how it goes.

Next, in your crush map for the storage type, you have an error:

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
type 11 storage

The *order* of types is very important in crush -- they must be nested
in the order they appear in the tree. "storage" should therefore be
something between host and osd.
If not, and if you use that type, it can break things.
But since you're not actually using "storage" at the moment, it
probably isn't causing any issue.

So -- could you go ahead with that chooseleaf fix then let us know how it goes?

Cheers, Dan

On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:

Hi again Dan!
Things are improving, all OSDs are up, but still that one PG is down.
More info below.

On 4/1/22 19:26, Dan van der Ster wrote:
Here is the output of "pg 85.12 query":
           https://pastebin.ubuntu.com/p/ww3JdwDXVd/
     and its status (also showing the other 85.XX, for reference):

This is very weird:

       "up": [
           2147483647,
           2147483647,
           2147483647,
           2147483647,
           2147483647
       ],
       "acting": [
           67,
           91,
           82,
           2147483647,
           112
       ],

Meanwhile, since a random PG still shows an output like the above one, I
think I found the problem with the crush rule: it syas "choose" rather
than "chooseleaf"!

rule csd-data-pool {
          id 5
          type erasure
          min_size 3
          max_size 5
          step set_chooseleaf_tries 5
          step set_choose_tries 100
          step take default class big
          step choose indep 0 type host    <--- HERE!
          step emit
}

...relic of a more complicated, two-step rule... sigh!

PGs are active if at least 3 shards are up.
Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
assuming 85.25 remains the one and only PG which is down?)

Yes, 85.25 is still the single 'down' PG.

pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
last_change 616460 flags
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
application rbd

Yup okay, we need to fix that later to make this cluster correctly
configured. To be followed up.

At some point, need to update min_size to 4.

If I understand correctly, it should now be safe (but I will wait for
your green light) to repeat the same for:
osd.121 chunk 85.11s0
osd.145 chunk 85.33s0
    so they can also start.

Yes, please go ahead and do the same.
I expect that your PG 85.25 will go active as soon as both those OSDs
start correctly.

Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down.
Its chunks are in:

85.25s0: osd.64
85.25s1: osd.140 osd.159
85.25s2: osd.96
85.25s3: osd.121 osd.176
85.25s4: osd.159 osd.56

BTW, I also noticed in your crush map below that the down osds have
crush weight zero!
So -- this means they are the only active OSDs for a PG, and they are
all set to be drained.
How did this happen? It is also surely part of the root cause here!

I suggest to reset the crush weight of those back to what it was
before, probably 1 ?

At some point I changed those weight to 0., but this was well after the
beginning of the problem: this helped, at least, healing a lot of
degraded/undersized.

After you have all the PGs active, we need to find out why their "up"
set is completely bogus.
This is evidence that your crush rule is broken.
If a PG doesn't have an complete "up" set, then it can never not be
degraded -- the PGs don't know where to go.

Do you think the choose-chooseleaf issue mentioned above, could be the
culprit?

I'm curious about that "storage" type you guys invented.

Oh, nothing too fancy... foreword, we happen to be using (and are
currently finally replacing) hardware (based on FiberChannel-SAN) which
is not the first choice in the Ceph world: but purchase happened before
we turned to Ceph as our storage solution. Each OSD server has access to
2 such distinct storage systems, hence the idea to describe these
failure domains in the crush rule.

Could you please copy to pastebin and share the crush.txt from

ceph osd getcrushmap -o crush.map
crushtool -d crush.map -o crush.txt

Here it is:
         https://pastebin.ubuntu.com/p/THkcT6xNgC/

Sure! Here it is. For historical reasons there are buckets of type
"storage" which however you can safely ignore as they are no longer
present in any crush_rule.

I think they may be relevant, as mentioned earlier.

Please also don't worry about the funny weights, as I am preparing for
hardware replacemente and am freeing up space.

As a general rule, never drain osds (never decrease their crush
weight) when any PG is degraded.
You risk deleting the last copy of a PG!

--
Fulvio Galeazzi
GARR-CSD Department
tel.: +39-334-6533-250
skype: fgaleazzi70

--
Fulvio Galeazzi
GARR-CSD Department
tel.: +39-334-6533-250
skype: fgaleazzi70
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx