Re: PG down, due to 3 OSD failing

Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> · Mon, 4 Apr 2022 10:01:15 +0200

Hi again Dan!
Things are improving, all OSDs are up, but still that one PG is down. 
More info below.

On 4/1/22 19:26, Dan van der Ster wrote:
Here is the output of "pg 85.12 query":
          https://pastebin.ubuntu.com/p/ww3JdwDXVd/
    and its status (also showing the other 85.XX, for reference):

This is very weird:

      "up": [
          2147483647,
          2147483647,
          2147483647,
          2147483647,
          2147483647
      ],
      "acting": [
          67,
          91,
          82,
          2147483647,
          112
      ],

Meanwhile, since a random PG still shows an output like the above one, I 
think I found the problem with the crush rule: it syas "choose" rather 
than "chooseleaf"!

rule csd-data-pool {
        id 5
        type erasure
        min_size 3
        max_size 5
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class big
        step choose indep 0 type host    <--- HERE!
        step emit
}

...relic of a more complicated, two-step rule... sigh!

PGs are active if at least 3 shards are up.
Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
assuming 85.25 remains the one and only PG which is down?)

Yes, 85.25 is still the single 'down' PG.

pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
last_change 616460 flags
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
application rbd

Yup okay, we need to fix that later to make this cluster correctly
configured. To be followed up.

At some point, need to update min_size to 4.

If I understand correctly, it should now be safe (but I will wait for
your green light) to repeat the same for:
osd.121 chunk 85.11s0
osd.145 chunk 85.33s0
   so they can also start.

Yes, please go ahead and do the same.
I expect that your PG 85.25 will go active as soon as both those OSDs
start correctly.

Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down.
Its chunks are in:

85.25s0: osd.64
85.25s1: osd.140 osd.159
85.25s2: osd.96
85.25s3: osd.121 osd.176
85.25s4: osd.159 osd.56

BTW, I also noticed in your crush map below that the down osds have
crush weight zero!
So -- this means they are the only active OSDs for a PG, and they are
all set to be drained.
How did this happen? It is also surely part of the root cause here!

I suggest to reset the crush weight of those back to what it was
before, probably 1 ?

At some point I changed those weight to 0., but this was well after the 
beginning of the problem: this helped, at least, healing a lot of 
degraded/undersized.

After you have all the PGs active, we need to find out why their "up"
set is completely bogus.
This is evidence that your crush rule is broken.
If a PG doesn't have an complete "up" set, then it can never not be
degraded -- the PGs don't know where to go.

Do you think the choose-chooseleaf issue mentioned above, could be the 
culprit?

I'm curious about that "storage" type you guys invented.

Oh, nothing too fancy... foreword, we happen to be using (and are 
currently finally replacing) hardware (based on FiberChannel-SAN) which 
is not the first choice in the Ceph world: but purchase happened before 
we turned to Ceph as our storage solution. Each OSD server has access to 
2 such distinct storage systems, hence the idea to describe these 
failure domains in the crush rule.

Could you please copy to pastebin and share the crush.txt from

ceph osd getcrushmap -o crush.map
crushtool -d crush.map -o crush.txt

Here it is:
	https://pastebin.ubuntu.com/p/THkcT6xNgC/

Sure! Here it is. For historical reasons there are buckets of type
"storage" which however you can safely ignore as they are no longer
present in any crush_rule.

I think they may be relevant, as mentioned earlier.

Please also don't worry about the funny weights, as I am preparing for
hardware replacemente and am freeing up space.

As a general rule, never drain osds (never decrease their crush
weight) when any PG is degraded.
You risk deleting the last copy of a PG!

--
Fulvio Galeazzi
GARR-CSD Department
tel.: +39-334-6533-250
skype: fgaleazzi70
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx