Re: Ceph does not recover from OSD restart

Frank Schilder <frans@xxxxxx> · Thu, 6 Aug 2020 11:17:20 +0000

Hi Eric,

yes, I had network restarts as well along the way. However, these should also not lead to the redundancy degradation I observed, it doesn't really explain why ceph lost track of so many objects. A temporary network outage on a server is an event that the cluster ought to survive without such damage.

What does "transitioning to Stray" mean/indicate here?

I did another test today and collected logs for a tracker issue. The problem can be reproduced and occurs if an "old" OSD is restarted, it does not happen when a "new" OSD restarts. Ceph seems to loose track of any placement information computed according to the original crush map from before adding OSDs. It looks like PG remappings are deleted when an OSD shuts down and can only be recovered for placements according to the new crush map, hence, permanent loss of information on restart of an "old" OSD. If one restores the original crush map for long enough, for example, by moving OSDs out of the sub-tree, the cluster can restore all PG remappings and restore full redundancy.

Correct behaviour would be either to maintain the remappings until an OSD is explicitly purged from the cluster, or to check for object locations with respect to all relevant crush maps in the history.

Another option would be, that every OSD checks on boot if it holds a copy of a certain version of an object that the cluster is looking for (reports as missing) and says "hey, I have it here" if found. This is, in fact, what I expected was implemented.

The current behaviour is a real danger. Rebalancing after storage extension is *not* a degraded state, but a normal and usually lengthy maintenance operation. A network fail or host reboot during such operations should not cause havoc, in particular, not this kind of "virtual" data loss while all data is physically present and all hrdware is healthy. Objects that are present on some disk should be found automatically under any circumstances.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eric Smith <Eric.Smith@xxxxxxxxxx>
Sent: 05 August 2020 13:35:44
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

You have a LOT of state transitions during your maintenance and I'm not really sure why (There are a lot of complaints about the network). There's are also a lot of "transitioning to Stray" after initial startup of an OSD. I'd say let your cluster heal first before you start doing a ton a maintenance so old PG maps can be trimmed. That's the best I can ascertain from the logs for now.

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Tuesday, August 4, 2020 8:35 AM
To: Eric Smith <Eric.Smith@xxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re: Ceph does not recover from OSD restart

If with monitor log you mean the cluster log /var/log/ceph/ceph.log, I should have all of it. Please find a tgz-file here: https://linkprotect.cudasvc.com/url?a=https%3a%2f%2ffiles.dtu.dk%2fu%2ftFCEZJzQhH2mUIRk%2flogs.tgz%3fl&c=E,1,uqVWoKuvpNjjLYU1JO2De96Pz8ZN-UBmy7cFmI6RllcEJg1Nboe8wNTzEx0kJ4WGDxciAY2Mnq_jWNncInKPg-wSwWzu2kV-ZmWlJVb_O9P-At48cWcXTDI9&typo=1 (valid 100 days).

Contents:

logs/ceph-2020-08-03.log  -  cluster log for the day of restart logs/ceph-osd.145.2020-08-03.log  -  log of "old" OSD trimmed to day of restart logs/ceph-osd.288.log  -  entire log of "new" OSD

Hope this helps.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eric Smith <Eric.Smith@xxxxxxxxxx>
Sent: 04 August 2020 14:15:11
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

Do you have any monitor / OSD logs from the maintenance when the issues occurred?

-------- Original message --------
From: Frank Schilder <frans@xxxxxx>
Date: 8/4/20 8:07 AM (GMT-05:00)
To: Eric Smith <Eric.Smith@xxxxxxxxxx>, ceph-users <ceph-users@xxxxxxx>
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

thanks for the clarification, I did misunderstand you.

> You should not have to move OSDs in and out of the CRUSH tree however
> in order to solve any data placement problems (This is the baffling part).

Exactly. Should I create a tracker issue? I think this is not hard to reproduce with a standard crush map where host-bucket=physical host and I would, in fact, expect that this scenario is part of the integration test.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eric Smith <Eric.Smith@xxxxxxxxxx>
Sent: 04 August 2020 13:58:47
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

All seems in order in terms of your CRUSH layout. You can speed up the rebalancing / scale-out operations by increasing the osd_max_backfills on each OSD (Especially during off hours). The unnecessary degradation is not expected behavior with a cluster in HEALTH_OK status, but with backfill / rebalancing ongoing it's not unexpected. You should not have to move OSDs in and out of the CRUSH tree however in order to solve any data placement problems (This is the baffling part).

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Tuesday, August 4, 2020 7:45 AM
To: Eric Smith <Eric.Smith@xxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re: Ceph does not recover from OSD restart

Hi Erik,

I added the disks and started the rebalancing. When I run into the issue, ca. 3 days after start of rebalancing, it was about 25% done. The cluster does not go to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects misplaced" warning. The OSD crush locations for the logical hosts are in ceph.conf, the OSDs come up in the proper crush bucket.

> All seems in order then

In what sense?

The rebalancing is still ongoing and usually a very long operation. This time I added only 9 disks, but we will almost triple the number of disks of a larger pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this expansion to take months. Due to a memory leak, I need to restart OSDs regularly. Also, a host may restart or we might have a power outage during this window. In these situations, it will be a real pain if I have to play the crush move game with 300+ OSDs.

This unnecessary redundancy degradation on OSD restart cannot possibly be expected behaviour, or do I misunderstand something here?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eric Smith <Eric.Smith@xxxxxxxxxx>
Sent: 04 August 2020 13:19:41
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

All seems in order then - when you ran into your maintenance issue, how long was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it could trim PG history? Also did the OSDs just start back up in the wrong place in the CRUSH tree?

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Tuesday, August 4, 2020 7:10 AM
To: Eric Smith <Eric.Smith@xxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

> Have you adjusted the min_size for pool sr-rbd-data-one-hdd

Yes. For all EC pools located in datacenter ServerRoom, we currently set min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are logical but not physical, disks in these buckets are co-located such that no more than 2 host buckets share the same physical host. With failure domain = host, we can ensure that no more than 2 EC shards are on the same physical host. With m=2 and min_size=k we have continued service with any 1 physical host down for maintenance and also recovery will happen if a physical host fails. Some objects will have no redundancy for a while then. We will increase min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs from buckets ceph-21/22 to these without rebalancing.

The distribution of disks and buckets is listed below as well (longer listing).

Thanks and best regards,
Frank

# ceph osd erasure-code-profile ls
con-ec-8-2-hdd
con-ec-8-2-ssd
default
sr-ec-6-2-hdd

This is the relevant one:

# ceph osd erasure-code-profile get sr-ec-6-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ServerRoom jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Note that the pool sr-rbd-data-one (id 2) was created with this profile and later moved to SSD. Therefore, the crush rule does not match the profile's device class any more.

These two are under different roots:

# ceph osd erasure-code-profile get con-ec-8-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get con-ec-8-2-ssd crush-device-class=ssd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false
k=8
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Full physical placement information for OSDs under tree "datacenter ServerRoom":

----------------
ceph-04
----------------
CONT            ID  BUCKET       SIZE  TYP
osd-phy0       243  ceph-04      1.8T  SSD
osd-phy1       247  ceph-21      1.8T  SSD
osd-phy2       254  ceph-04      1.8T  SSD
osd-phy3       256  ceph-04      1.8T  SSD
osd-phy4       286  ceph-04      1.8T  SSD
osd-phy5       287  ceph-04      1.8T  SSD
osd-phy6       288  ceph-04     10.7T  HDD
osd-phy7        48  ceph-04    372.6G  SSD
osd-phy8       264  ceph-21      1.8T  SSD
osd-phy9        84  ceph-04      8.9T  HDD
osd-phy10       72  ceph-21      8.9T  HDD
osd-phy11      145  ceph-04      8.9T  HDD
osd-phy14      156  ceph-04      8.9T  HDD
osd-phy15      168  ceph-04      8.9T  HDD
osd-phy16      181  ceph-04      8.9T  HDD
osd-phy17        0  ceph-21      8.9T  HDD
----------------
ceph-05
----------------
CONT            ID  BUCKET       SIZE  TYP
osd-phy0       240  ceph-05      1.8T  SSD
osd-phy1       249  ceph-22      1.8T  SSD
osd-phy2       251  ceph-05      1.8T  SSD
osd-phy3       255  ceph-05      1.8T  SSD
osd-phy4       284  ceph-05      1.8T  SSD
osd-phy5       285  ceph-05      1.8T  SSD
osd-phy6       289  ceph-05     10.7T  HDD
osd-phy7        49  ceph-05    372.6G  SSD
osd-phy8       265  ceph-22      1.8T  SSD
osd-phy9        74  ceph-05      8.9T  HDD
osd-phy10       85  ceph-22      8.9T  HDD
osd-phy11      144  ceph-05      8.9T  HDD
osd-phy14      157  ceph-05      8.9T  HDD
osd-phy15      169  ceph-05      8.9T  HDD
osd-phy16      180  ceph-05      8.9T  HDD
osd-phy17        1  ceph-22      8.9T  HDD
----------------
ceph-06
----------------
CONT            ID  BUCKET       SIZE  TYP
osd-phy0       244  ceph-06      1.8T  SSD
osd-phy1       246  ceph-21      1.8T  SSD
osd-phy2       253  ceph-06      1.8T  SSD
osd-phy3       257  ceph-06      1.8T  SSD
osd-phy4       282  ceph-06      1.8T  SSD
osd-phy5       283  ceph-06      1.8T  SSD
osd-phy6        40  ceph-06    372.6G  SSD
osd-phy7        50  ceph-06    372.6G  SSD
osd-phy8        60  ceph-06      8.9T  HDD
osd-phy9       290  ceph-06     10.7T  HDD
osd-phy10      291  ceph-21     10.7T  HDD
osd-phy11      146  ceph-06      8.9T  HDD
osd-phy14      158  ceph-06      8.9T  HDD
osd-phy15      170  ceph-06      8.9T  HDD
osd-phy16      182  ceph-06      8.9T  HDD
osd-phy17        2  ceph-21      8.9T  HDD
----------------
ceph-07
----------------
CONT            ID  BUCKET       SIZE  TYP
osd-phy0       242  ceph-07      1.8T  SSD
osd-phy1       250  ceph-22      1.8T  SSD
osd-phy2       252  ceph-07      1.8T  SSD
osd-phy3       258  ceph-07      1.8T  SSD
osd-phy4       279  ceph-07      1.8T  SSD
osd-phy5       280  ceph-07      1.8T  SSD
osd-phy6       292  ceph-07     10.7T  HDD
osd-phy7        52  ceph-07    372.6G  SSD
osd-phy8        63  ceph-07      8.9T  HDD
osd-phy9       281  ceph-22      1.8T  SSD
osd-phy10       87  ceph-22      8.9T  HDD
osd-phy11      148  ceph-07      8.9T  HDD
osd-phy14      159  ceph-07      8.9T  HDD
osd-phy15      172  ceph-07      8.9T  HDD
osd-phy16      183  ceph-07      8.9T  HDD
osd-phy17        3  ceph-22      8.9T  HDD
----------------
ceph-18
----------------
CONT            ID  BUCKET       SIZE  TYP
osd-phy0       241  ceph-18      1.8T  SSD
osd-phy1       248  ceph-18      1.8T  SSD
osd-phy2        41  ceph-18    372.6G  SSD
osd-phy3        31  ceph-18    372.6G  SSD
osd-phy4       277  ceph-18      1.8T  SSD
osd-phy5       278  ceph-21      1.8T  SSD
osd-phy6        53  ceph-21    372.6G  SSD
osd-phy7       267  ceph-18      1.8T  SSD
osd-phy8       266  ceph-18      1.8T  SSD
osd-phy9       293  ceph-18     10.7T  HDD
osd-phy10       86  ceph-21      8.9T  HDD
osd-phy11      259  ceph-18     10.9T  HDD
osd-phy14      229  ceph-18      8.9T  HDD
osd-phy15      232  ceph-18      8.9T  HDD
osd-phy16      235  ceph-18      8.9T  HDD
osd-phy17      238  ceph-18      8.9T  HDD
----------------
ceph-19
----------------
CONT            ID  BUCKET       SIZE  TYP
osd-phy0       261  ceph-19      1.8T  SSD
osd-phy1       262  ceph-19      1.8T  SSD
osd-phy2       295  ceph-19     10.7T  HDD
osd-phy3        43  ceph-19    372.6G  SSD
osd-phy4       275  ceph-19      1.8T  SSD
osd-phy5       294  ceph-22     10.7T  HDD
osd-phy6        51  ceph-22    372.6G  SSD
osd-phy7       269  ceph-19      1.8T  SSD
osd-phy8       268  ceph-19      1.8T  SSD
osd-phy9       276  ceph-22      1.8T  SSD
osd-phy10       73  ceph-22      8.9T  HDD
osd-phy11      263  ceph-19     10.9T  HDD
osd-phy14      231  ceph-19      8.9T  HDD
osd-phy15      233  ceph-19      8.9T  HDD
osd-phy16      236  ceph-19      8.9T  HDD
osd-phy17      239  ceph-19      8.9T  HDD
----------------
ceph-20
----------------
CONT            ID  BUCKET       SIZE  TYP
osd-phy0       245  ceph-20      1.8T  SSD
osd-phy1        28  ceph-20    372.6G  SSD
osd-phy2        44  ceph-20    372.6G  SSD
osd-phy3       271  ceph-20      1.8T  SSD
osd-phy4       272  ceph-20      1.8T  SSD
osd-phy5       273  ceph-20      1.8T  SSD
osd-phy6       274  ceph-21      1.8T  SSD
osd-phy7       296  ceph-20     10.7T  HDD
osd-phy8        76  ceph-21      8.9T  HDD
osd-phy9        39  ceph-21    372.6G  SSD
osd-phy10      270  ceph-20      1.8T  SSD
osd-phy11      260  ceph-20     10.9T  HDD
osd-phy14      228  ceph-20      8.9T  HDD
osd-phy15      230  ceph-20      8.9T  HDD
osd-phy16      234  ceph-20      8.9T  HDD
osd-phy17      237  ceph-20      8.9T  HDD

CONT is the container name and encodes the physical slot on the host where the OSD is located.

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eric Smith <Eric.Smith@xxxxxxxxxx>
Sent: 04 August 2020 12:47:12
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

Have you adjusted the min_size for pool sr-rbd-data-one-hdd at all? Also can you send the output of "ceph osd erasure-code-profile ls" and for each EC profile, "ceph osd erasure-code-profile get <profile>"?

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Monday, August 3, 2020 11:05 AM
To: Eric Smith <Eric.Smith@xxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re: Ceph does not recover from OSD restart

Sorry for the many small e-mails: requested IDs in the commands, 288-296. One new OSD per host.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 03 August 2020 16:59:04
To: Eric Smith; ceph-users
Subject:  Re: Ceph does not recover from OSD restart

Hi Eric,

the procedure for re-discovering all objects is:

# Flag: norebalance

ceph osd crush move osd.288 host=bb-04
ceph osd crush move osd.289 host=bb-05
ceph osd crush move osd.290 host=bb-06
ceph osd crush move osd.291 host=bb-21
ceph osd crush move osd.292 host=bb-07
ceph osd crush move osd.293 host=bb-18
ceph osd crush move osd.295 host=bb-19
ceph osd crush move osd.294 host=bb-22
ceph osd crush move osd.296 host=bb-20

# Wait until all PGs are peered and recovery is done. In my case, there was only little I/O, # no more than 50-100 objects had writes missing and recovery was a few seconds.
#
# The bb-hosts are under a separate crush root that I use solely as parking space # and for draining OSDs.

ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20

After peering, no degraded PGs/objects any more, just the misplaced ones as expected.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eric Smith <Eric.Smith@xxxxxxxxxx>
Sent: 03 August 2020 16:45:28
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added?

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Monday, August 3, 2020 9:55 AM
To: Eric Smith <Eric.Smith@xxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re: Ceph does not recover from OSD restart

Hi Eric,

thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself.

Best regards,
Frank

# ceph osd pool stats
pool sr-rbd-meta-one id 1
  client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr

pool sr-rbd-data-one id 2
  client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr

pool sr-rbd-one-stretch id 3
  nothing is going on

pool con-rbd-meta-hpc-one id 7
  nothing is going on

pool con-rbd-data-hpc-one id 8
  client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr

pool sr-rbd-data-one-hdd id 11
  53241814/346903376 objects misplaced (15.348%)
  client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr

pool con-fs2-meta1 id 12
  client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr

pool con-fs2-meta2 id 13
  client io 0 B/s wr, 0 op/s rd, 0 op/s wr

pool con-fs2-data id 14
  client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr

pool con-fs2-data-ec-ssd id 17
  nothing is going on

pool ms-rbd-one id 18
  client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr

# ceph osd pool ls detail
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd
        removed_snaps [1~45]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
        removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
        removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd
        removed_snaps [1~3]
pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd
        removed_snaps [1~7,9~2]
pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd
        removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1]
        removed_snaps_queue [1a64~5,1a6a~1,1a6c~1, ... long list ... ,220a~1,220c~1] pool 12 'con-fs2-meta1' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 57096 flags hashpspool,nodelete max_bytes 268435456000 stripe_width 0 application cephfs pool 13 'con-fs2-meta2' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96359 flags hashpspool,nodelete max_bytes 107374182400 stripe_width 0 application cephfs pool 14 'con-fs2-data' erasure size 10 min_size 9 crush_rule 8 object_hash rjenkins pg_num 1350 pgp_num 1350 last_change 96360 lfor 0/91144 flags hashpspool,ec_overwrites,nodelete max_bytes 879609302220800 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 17 'con-fs2-data-ec-ssd' erasure size 10 min_size 9 crush_rule 10 object_hash rjenkins pg_num 55 pgp_num 55 last_change 96361 lfor 0/90473 flags hashpspool,ec_overwrites,nodelete max_bytes 1099511627776 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 18 'ms-rbd-one' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 150 pgp_num 150 last_change 143206 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd
        removed_snaps [1~3]

# ceph osd tree
ID   CLASS    WEIGHT     TYPE NAME                          STATUS REWEIGHT PRI-AFF
 -40          2384.09058 root DTU
 -42                   0     region Lyngby
 -41          2384.09058     region Risoe

2 sub-trees on level datacenter removed for brevity

 -49           586.49347         datacenter ServerRoom
 -55           586.49347             room SR-113
 -65            64.33617                 host ceph-04
  84      hdd    8.90999                     osd.84             up  1.00000 1.00000
 145      hdd    8.90999                     osd.145            up  1.00000 1.00000
 156      hdd    8.90999                     osd.156            up  1.00000 1.00000
 168      hdd    8.90999                     osd.168            up  1.00000 1.00000
 181      hdd    8.90999                     osd.181            up  0.95000 1.00000
 288      hdd   10.69229                     osd.288            up  1.00000 1.00000
 243 rbd_data    1.74599                     osd.243            up  1.00000 1.00000
 254 rbd_data    1.74599                     osd.254            up  1.00000 1.00000
 256 rbd_data    1.74599                     osd.256            up  1.00000 1.00000
 286 rbd_data    1.74599                     osd.286            up  1.00000 1.00000
 287 rbd_data    1.74599                     osd.287            up  1.00000 1.00000
  48 rbd_meta    0.36400                     osd.48             up  1.00000 1.00000
 -67            64.33617                 host ceph-05
  74      hdd    8.90999                     osd.74             up  1.00000 1.00000
 144      hdd    8.90999                     osd.144            up  1.00000 1.00000
 157      hdd    8.90999                     osd.157            up  0.84999 1.00000
 169      hdd    8.90999                     osd.169            up  0.95000 1.00000
 180      hdd    8.90999                     osd.180            up  0.89999 1.00000
 289      hdd   10.69229                     osd.289            up  1.00000 1.00000
 240 rbd_data    1.74599                     osd.240            up  1.00000 1.00000
 251 rbd_data    1.74599                     osd.251            up  1.00000 1.00000
 255 rbd_data    1.74599                     osd.255            up  1.00000 1.00000
 284 rbd_data    1.74599                     osd.284            up  1.00000 1.00000
 285 rbd_data    1.74599                     osd.285            up  1.00000 1.00000
  49 rbd_meta    0.36400                     osd.49             up  1.00000 1.00000
 -69            64.70016                 host ceph-06
  60      hdd    8.90999                     osd.60             up  1.00000 1.00000
 146      hdd    8.90999                     osd.146            up  1.00000 1.00000
 158      hdd    8.90999                     osd.158            up  0.95000 1.00000
 170      hdd    8.90999                     osd.170            up  0.89999 1.00000
 182      hdd    8.90999                     osd.182            up  1.00000 1.00000
 290      hdd   10.69229                     osd.290            up  1.00000 1.00000
 244 rbd_data    1.74599                     osd.244            up  1.00000 1.00000
 253 rbd_data    1.74599                     osd.253            up  1.00000 1.00000
 257 rbd_data    1.74599                     osd.257            up  1.00000 1.00000
 282 rbd_data    1.74599                     osd.282            up  1.00000 1.00000
 283 rbd_data    1.74599                     osd.283            up  1.00000 1.00000
  40 rbd_meta    0.36400                     osd.40             up  1.00000 1.00000
  50 rbd_meta    0.36400                     osd.50             up  1.00000 1.00000
 -71            64.33617                 host ceph-07
  63      hdd    8.90999                     osd.63             up  1.00000 1.00000
 148      hdd    8.90999                     osd.148            up  0.95000 1.00000
 159      hdd    8.90999                     osd.159            up  1.00000 1.00000
 172      hdd    8.90999                     osd.172            up  0.95000 1.00000
 183      hdd    8.90999                     osd.183            up  0.84999 1.00000
 292      hdd   10.69229                     osd.292            up  1.00000 1.00000
 242 rbd_data    1.74599                     osd.242            up  1.00000 1.00000
 252 rbd_data    1.74599                     osd.252            up  1.00000 1.00000
 258 rbd_data    1.74599                     osd.258            up  1.00000 1.00000
 279 rbd_data    1.74599                     osd.279            up  1.00000 1.00000
 280 rbd_data    1.74599                     osd.280            up  1.00000 1.00000
  52 rbd_meta    0.36400                     osd.52             up  1.00000 1.00000
 -81            66.70416                 host ceph-18
 229      hdd    8.90999                     osd.229            up  1.00000 1.00000
 232      hdd    8.90999                     osd.232            up  1.00000 1.00000
 235      hdd    8.90999                     osd.235            up  1.00000 1.00000
 238      hdd    8.90999                     osd.238            up  0.95000 1.00000
 259      hdd   10.91399                     osd.259            up  1.00000 1.00000
 293      hdd   10.69229                     osd.293            up  1.00000 1.00000
 241 rbd_data    1.74599                     osd.241            up  1.00000 1.00000
 248 rbd_data    1.74599                     osd.248            up  1.00000 1.00000
 266 rbd_data    1.74599                     osd.266            up  1.00000 1.00000
 267 rbd_data    1.74599                     osd.267            up  1.00000 1.00000
 277 rbd_data    1.74599                     osd.277            up  1.00000 1.00000
  31 rbd_meta    0.36400                     osd.31             up  1.00000 1.00000
  41 rbd_meta    0.36400                     osd.41             up  1.00000 1.00000
 -94            66.34016                 host ceph-19
 231      hdd    8.90999                     osd.231            up  1.00000 1.00000
 233      hdd    8.90999                     osd.233            up  0.95000 1.00000
 236      hdd    8.90999                     osd.236            up  1.00000 1.00000
 239      hdd    8.90999                     osd.239            up  1.00000 1.00000
 263      hdd   10.91399                     osd.263            up  1.00000 1.00000
 295      hdd   10.69229                     osd.295            up  1.00000 1.00000
 261 rbd_data    1.74599                     osd.261            up  1.00000 1.00000
 262 rbd_data    1.74599                     osd.262            up  1.00000 1.00000
 268 rbd_data    1.74599                     osd.268            up  1.00000 1.00000
 269 rbd_data    1.74599                     osd.269            up  1.00000 1.00000
 275 rbd_data    1.74599                     osd.275            up  1.00000 1.00000
  43 rbd_meta    0.36400                     osd.43             up  1.00000 1.00000
  -4            66.70416                 host ceph-20
 228      hdd    8.90999                     osd.228            up  1.00000 1.00000
 230      hdd    8.90999                     osd.230            up  1.00000 1.00000
 234      hdd    8.90999                     osd.234            up  0.95000 1.00000
 237      hdd    8.90999                     osd.237            up  1.00000 1.00000
 260      hdd   10.91399                     osd.260            up  1.00000 1.00000
 296      hdd   10.69229                     osd.296            up  1.00000 1.00000
 245 rbd_data    1.74599                     osd.245            up  1.00000 1.00000
 270 rbd_data    1.74599                     osd.270            up  1.00000 1.00000
 271 rbd_data    1.74599                     osd.271            up  1.00000 1.00000
 272 rbd_data    1.74599                     osd.272            up  1.00000 1.00000
 273 rbd_data    1.74599                     osd.273            up  1.00000 1.00000
  28 rbd_meta    0.36400                     osd.28             up  1.00000 1.00000
  44 rbd_meta    0.36400                     osd.44             up  1.00000 1.00000
 -64            64.70016                 host ceph-21
   0      hdd    8.90999                     osd.0              up  1.00000 1.00000
   2      hdd    8.90999                     osd.2              up  0.95000 1.00000
  72      hdd    8.90999                     osd.72             up  1.00000 1.00000
  76      hdd    8.90999                     osd.76             up  1.00000 1.00000
  86      hdd    8.90999                     osd.86             up  1.00000 1.00000
 291      hdd   10.69229                     osd.291            up  1.00000 1.00000
 246 rbd_data    1.74599                     osd.246            up  1.00000 1.00000
 247 rbd_data    1.74599                     osd.247            up  1.00000 1.00000
 264 rbd_data    1.74599                     osd.264            up  1.00000 1.00000
 274 rbd_data    1.74599                     osd.274            up  1.00000 1.00000
 278 rbd_data    1.74599                     osd.278            up  1.00000 1.00000
  39 rbd_meta    0.36400                     osd.39             up  1.00000 1.00000
  53 rbd_meta    0.36400                     osd.53             up  1.00000 1.00000
 -66            64.33617                 host ceph-22
   1      hdd    8.90999                     osd.1              up  1.00000 1.00000
   3      hdd    8.90999                     osd.3              up  1.00000 1.00000
  73      hdd    8.90999                     osd.73             up  1.00000 1.00000
  85      hdd    8.90999                     osd.85             up  0.95000 1.00000
  87      hdd    8.90999                     osd.87             up  1.00000 1.00000
 294      hdd   10.69229                     osd.294            up  1.00000 1.00000
 249 rbd_data    1.74599                     osd.249            up  1.00000 1.00000
 250 rbd_data    1.74599                     osd.250            up  1.00000 1.00000
 265 rbd_data    1.74599                     osd.265            up  1.00000 1.00000
 276 rbd_data    1.74599                     osd.276            up  1.00000 1.00000
 281 rbd_data    1.74599                     osd.281            up  1.00000 1.00000
  51 rbd_meta    0.36400                     osd.51             up  1.00000 1.00000

# ceph osd crush rule dump # crush rules outside tree under "datacenter ServerRoom" removed for brevity [
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 5,
        "rule_name": "sr-rbd-data-one",
        "ruleset": 5,
        "type": 3,
        "min_size": 3,
        "max_size": 8,
        "steps": [
            {
                "op": "set_chooseleaf_tries",
                "num": 50
            },
            {
                "op": "set_choose_tries",
                "num": 1000
            },
            {
                "op": "take",
                "item": -185,
                "item_name": "ServerRoom~rbd_data"
            },
            {
                "op": "chooseleaf_indep",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 9,
        "rule_name": "sr-rbd-data-one-hdd",
        "ruleset": 9,
        "type": 3,
        "min_size": 3,
        "max_size": 8,
        "steps": [
            {
                "op": "set_chooseleaf_tries",
                "num": 5
            },
            {
                "op": "set_choose_tries",
                "num": 100
            },
            {
                "op": "take",
                "item": -53,
                "item_name": "ServerRoom~hdd"
            },
            {
                "op": "chooseleaf_indep",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eric Smith <Eric.Smith@xxxxxxxxxx>
Sent: 03 August 2020 15:40
To: Frank Schilder; ceph-users
Subject: RE: Ceph does not recover from OSD restart

Can you post the output of these commands:

ceph osd pool ls detail
ceph osd tree
ceph osd crush rule dump

-----Original Message-----
From: Frank Schilder <frans@xxxxxx>
Sent: Monday, August 3, 2020 9:19 AM
To: ceph-users <ceph-users@xxxxxxx>
Subject:  Re: Ceph does not recover from OSD restart

After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see:

  cluster:
    id:     e4ece518-f2cb-4708-b00f-b6bf511e91d9
    health: HEALTH_WARN
            norebalance,norecover flag(s) set
            53030026/1492404361 objects misplaced (3.553%)
            1 pools nearfull

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-03, ceph-02
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 297 osds: 272 up, 272 in; 307 remapped pgs
         flags norebalance,norecover

  data:
    pools:   11 pools, 3215 pgs
    objects: 177.3 M objects, 489 TiB
    usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
    pgs:     53030026/1492404361 objects misplaced (3.553%)
             2902 active+clean
             299  active+remapped+backfill_wait
             8    active+remapped+backfilling
             5    active+clean+scrubbing+deep
             1    active+clean+snaptrim

  io:
    client:   69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr

Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects?
Why is it not finding the objects by itself?

A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 03 August 2020 15:03:05
To: ceph-users
Subject:  Ceph does not recover from OSD restart

Dear cephers,

I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that:

  cluster:
    id:     xxx
    health: HEALTH_ERR
            45813194/1492348700 objects misplaced (3.070%)
            Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized
            Degraded data redundancy (low space): 17 pgs backfill_toofull
            1 pools nearfull

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-03, ceph-02
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 297 osds: 272 up, 272 in; 307 remapped pgs

  data:
    pools:   11 pools, 3215 pgs
    objects: 177.3 M objects, 489 TiB
    usage:   696 TiB used, 1.2 PiB / 1.9 PiB avail
    pgs:     6798138/1492348700 objects degraded (0.456%)
             45813194/1492348700 objects misplaced (3.070%)
             2903 active+clean
             209  active+remapped+backfill_wait
             73   active+undersized+degraded+remapped+backfill_wait
             9    active+remapped+backfill_wait+backfill_toofull
             8    active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             4    active+undersized+degraded+remapped+backfilling
             3    active+remapped+backfilling
             3    active+clean+scrubbing+deep
             1    active+clean+scrubbing
             1    active+undersized+remapped+backfilling
             1    active+clean+snaptrim

  io:
    client:   47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr
    recovery: 195 MiB/s, 48 objects/s

After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these:

[root@gnosis ~]# ceph health detail
HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized
    pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169]
8...9
    pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72]
    pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647]
[...]
    pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86]
PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull
    pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146]
[...]
    pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2]
    pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230]
    pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86]
POOL_NEAR_FULL 1 pools nearfull
    pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB)

It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool.

What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx