You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added? -----Original Message----- From: Frank Schilder <frans@xxxxxx> Sent: Monday, August 3, 2020 9:55 AM To: Eric Smith <Eric.Smith@xxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx> Subject: Re: Ceph does not recover from OSD restart Hi Eric, thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself. Best regards, Frank # ceph osd pool stats pool sr-rbd-meta-one id 1 client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr pool sr-rbd-data-one id 2 client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr pool sr-rbd-one-stretch id 3 nothing is going on pool con-rbd-meta-hpc-one id 7 nothing is going on pool con-rbd-data-hpc-one id 8 client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr pool sr-rbd-data-one-hdd id 11 53241814/346903376 objects misplaced (15.348%) client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr pool con-fs2-meta1 id 12 client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr pool con-fs2-meta2 id 13 client io 0 B/s wr, 0 op/s rd, 0 op/s wr pool con-fs2-data id 14 client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr pool con-fs2-data-ec-ssd id 17 nothing is going on pool ms-rbd-one id 18 client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr # ceph osd pool ls detail pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd removed_snaps [1~45] pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd removed_snaps [1~3] pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~7,9~2] pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1] removed_snaps_queue [1a64~5,1a6a~1,1a6c~1, ... long list ... ,220a~1,220c~1] pool 12 'con-fs2-meta1' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 57096 flags hashpspool,nodelete max_bytes 268435456000 stripe_width 0 application cephfs pool 13 'con-fs2-meta2' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96359 flags hashpspool,nodelete max_bytes 107374182400 stripe_width 0 application cephfs pool 14 'con-fs2-data' erasure size 10 min_size 9 crush_rule 8 object_hash rjenkins pg_num 1350 pgp_num 1350 last_change 96360 lfor 0/91144 flags hashpspool,ec_overwrites,nodelete max_bytes 879609302220800 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 17 'con-fs2-data-ec-ssd' erasure size 10 min_size 9 crush_rule 10 object_hash rjenkins pg_num 55 pgp_num 55 last_change 96361 lfor 0/90473 flags hashpspool,ec_overwrites,nodelete max_bytes 1099511627776 stripe_width 32768 fast_read 1 compression_mode aggressive application cephfs pool 18 'ms-rbd-one' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 150 pgp_num 150 last_change 143206 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd removed_snaps [1~3] # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -40 2384.09058 root DTU -42 0 region Lyngby -41 2384.09058 region Risoe 2 sub-trees on level datacenter removed for brevity -49 586.49347 datacenter ServerRoom -55 586.49347 room SR-113 -65 64.33617 host ceph-04 84 hdd 8.90999 osd.84 up 1.00000 1.00000 145 hdd 8.90999 osd.145 up 1.00000 1.00000 156 hdd 8.90999 osd.156 up 1.00000 1.00000 168 hdd 8.90999 osd.168 up 1.00000 1.00000 181 hdd 8.90999 osd.181 up 0.95000 1.00000 288 hdd 10.69229 osd.288 up 1.00000 1.00000 243 rbd_data 1.74599 osd.243 up 1.00000 1.00000 254 rbd_data 1.74599 osd.254 up 1.00000 1.00000 256 rbd_data 1.74599 osd.256 up 1.00000 1.00000 286 rbd_data 1.74599 osd.286 up 1.00000 1.00000 287 rbd_data 1.74599 osd.287 up 1.00000 1.00000 48 rbd_meta 0.36400 osd.48 up 1.00000 1.00000 -67 64.33617 host ceph-05 74 hdd 8.90999 osd.74 up 1.00000 1.00000 144 hdd 8.90999 osd.144 up 1.00000 1.00000 157 hdd 8.90999 osd.157 up 0.84999 1.00000 169 hdd 8.90999 osd.169 up 0.95000 1.00000 180 hdd 8.90999 osd.180 up 0.89999 1.00000 289 hdd 10.69229 osd.289 up 1.00000 1.00000 240 rbd_data 1.74599 osd.240 up 1.00000 1.00000 251 rbd_data 1.74599 osd.251 up 1.00000 1.00000 255 rbd_data 1.74599 osd.255 up 1.00000 1.00000 284 rbd_data 1.74599 osd.284 up 1.00000 1.00000 285 rbd_data 1.74599 osd.285 up 1.00000 1.00000 49 rbd_meta 0.36400 osd.49 up 1.00000 1.00000 -69 64.70016 host ceph-06 60 hdd 8.90999 osd.60 up 1.00000 1.00000 146 hdd 8.90999 osd.146 up 1.00000 1.00000 158 hdd 8.90999 osd.158 up 0.95000 1.00000 170 hdd 8.90999 osd.170 up 0.89999 1.00000 182 hdd 8.90999 osd.182 up 1.00000 1.00000 290 hdd 10.69229 osd.290 up 1.00000 1.00000 244 rbd_data 1.74599 osd.244 up 1.00000 1.00000 253 rbd_data 1.74599 osd.253 up 1.00000 1.00000 257 rbd_data 1.74599 osd.257 up 1.00000 1.00000 282 rbd_data 1.74599 osd.282 up 1.00000 1.00000 283 rbd_data 1.74599 osd.283 up 1.00000 1.00000 40 rbd_meta 0.36400 osd.40 up 1.00000 1.00000 50 rbd_meta 0.36400 osd.50 up 1.00000 1.00000 -71 64.33617 host ceph-07 63 hdd 8.90999 osd.63 up 1.00000 1.00000 148 hdd 8.90999 osd.148 up 0.95000 1.00000 159 hdd 8.90999 osd.159 up 1.00000 1.00000 172 hdd 8.90999 osd.172 up 0.95000 1.00000 183 hdd 8.90999 osd.183 up 0.84999 1.00000 292 hdd 10.69229 osd.292 up 1.00000 1.00000 242 rbd_data 1.74599 osd.242 up 1.00000 1.00000 252 rbd_data 1.74599 osd.252 up 1.00000 1.00000 258 rbd_data 1.74599 osd.258 up 1.00000 1.00000 279 rbd_data 1.74599 osd.279 up 1.00000 1.00000 280 rbd_data 1.74599 osd.280 up 1.00000 1.00000 52 rbd_meta 0.36400 osd.52 up 1.00000 1.00000 -81 66.70416 host ceph-18 229 hdd 8.90999 osd.229 up 1.00000 1.00000 232 hdd 8.90999 osd.232 up 1.00000 1.00000 235 hdd 8.90999 osd.235 up 1.00000 1.00000 238 hdd 8.90999 osd.238 up 0.95000 1.00000 259 hdd 10.91399 osd.259 up 1.00000 1.00000 293 hdd 10.69229 osd.293 up 1.00000 1.00000 241 rbd_data 1.74599 osd.241 up 1.00000 1.00000 248 rbd_data 1.74599 osd.248 up 1.00000 1.00000 266 rbd_data 1.74599 osd.266 up 1.00000 1.00000 267 rbd_data 1.74599 osd.267 up 1.00000 1.00000 277 rbd_data 1.74599 osd.277 up 1.00000 1.00000 31 rbd_meta 0.36400 osd.31 up 1.00000 1.00000 41 rbd_meta 0.36400 osd.41 up 1.00000 1.00000 -94 66.34016 host ceph-19 231 hdd 8.90999 osd.231 up 1.00000 1.00000 233 hdd 8.90999 osd.233 up 0.95000 1.00000 236 hdd 8.90999 osd.236 up 1.00000 1.00000 239 hdd 8.90999 osd.239 up 1.00000 1.00000 263 hdd 10.91399 osd.263 up 1.00000 1.00000 295 hdd 10.69229 osd.295 up 1.00000 1.00000 261 rbd_data 1.74599 osd.261 up 1.00000 1.00000 262 rbd_data 1.74599 osd.262 up 1.00000 1.00000 268 rbd_data 1.74599 osd.268 up 1.00000 1.00000 269 rbd_data 1.74599 osd.269 up 1.00000 1.00000 275 rbd_data 1.74599 osd.275 up 1.00000 1.00000 43 rbd_meta 0.36400 osd.43 up 1.00000 1.00000 -4 66.70416 host ceph-20 228 hdd 8.90999 osd.228 up 1.00000 1.00000 230 hdd 8.90999 osd.230 up 1.00000 1.00000 234 hdd 8.90999 osd.234 up 0.95000 1.00000 237 hdd 8.90999 osd.237 up 1.00000 1.00000 260 hdd 10.91399 osd.260 up 1.00000 1.00000 296 hdd 10.69229 osd.296 up 1.00000 1.00000 245 rbd_data 1.74599 osd.245 up 1.00000 1.00000 270 rbd_data 1.74599 osd.270 up 1.00000 1.00000 271 rbd_data 1.74599 osd.271 up 1.00000 1.00000 272 rbd_data 1.74599 osd.272 up 1.00000 1.00000 273 rbd_data 1.74599 osd.273 up 1.00000 1.00000 28 rbd_meta 0.36400 osd.28 up 1.00000 1.00000 44 rbd_meta 0.36400 osd.44 up 1.00000 1.00000 -64 64.70016 host ceph-21 0 hdd 8.90999 osd.0 up 1.00000 1.00000 2 hdd 8.90999 osd.2 up 0.95000 1.00000 72 hdd 8.90999 osd.72 up 1.00000 1.00000 76 hdd 8.90999 osd.76 up 1.00000 1.00000 86 hdd 8.90999 osd.86 up 1.00000 1.00000 291 hdd 10.69229 osd.291 up 1.00000 1.00000 246 rbd_data 1.74599 osd.246 up 1.00000 1.00000 247 rbd_data 1.74599 osd.247 up 1.00000 1.00000 264 rbd_data 1.74599 osd.264 up 1.00000 1.00000 274 rbd_data 1.74599 osd.274 up 1.00000 1.00000 278 rbd_data 1.74599 osd.278 up 1.00000 1.00000 39 rbd_meta 0.36400 osd.39 up 1.00000 1.00000 53 rbd_meta 0.36400 osd.53 up 1.00000 1.00000 -66 64.33617 host ceph-22 1 hdd 8.90999 osd.1 up 1.00000 1.00000 3 hdd 8.90999 osd.3 up 1.00000 1.00000 73 hdd 8.90999 osd.73 up 1.00000 1.00000 85 hdd 8.90999 osd.85 up 0.95000 1.00000 87 hdd 8.90999 osd.87 up 1.00000 1.00000 294 hdd 10.69229 osd.294 up 1.00000 1.00000 249 rbd_data 1.74599 osd.249 up 1.00000 1.00000 250 rbd_data 1.74599 osd.250 up 1.00000 1.00000 265 rbd_data 1.74599 osd.265 up 1.00000 1.00000 276 rbd_data 1.74599 osd.276 up 1.00000 1.00000 281 rbd_data 1.74599 osd.281 up 1.00000 1.00000 51 rbd_meta 0.36400 osd.51 up 1.00000 1.00000 # ceph osd crush rule dump # crush rules outside tree under "datacenter ServerRoom" removed for brevity [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 5, "rule_name": "sr-rbd-data-one", "ruleset": 5, "type": 3, "min_size": 3, "max_size": 8, "steps": [ { "op": "set_chooseleaf_tries", "num": 50 }, { "op": "set_choose_tries", "num": 1000 }, { "op": "take", "item": -185, "item_name": "ServerRoom~rbd_data" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 9, "rule_name": "sr-rbd-data-one-hdd", "ruleset": 9, "type": 3, "min_size": 3, "max_size": 8, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -53, "item_name": "ServerRoom~hdd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eric Smith <Eric.Smith@xxxxxxxxxx> Sent: 03 August 2020 15:40 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart Can you post the output of these commands: ceph osd pool ls detail ceph osd tree ceph osd crush rule dump -----Original Message----- From: Frank Schilder <frans@xxxxxx> Sent: Monday, August 3, 2020 9:19 AM To: ceph-users <ceph-users@xxxxxxx> Subject: Re: Ceph does not recover from OSD restart After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see: cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN norebalance,norecover flag(s) set 53030026/1492404361 objects misplaced (3.553%) 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs flags norebalance,norecover data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 53030026/1492404361 objects misplaced (3.553%) 2902 active+clean 299 active+remapped+backfill_wait 8 active+remapped+backfilling 5 active+clean+scrubbing+deep 1 active+clean+snaptrim io: client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects? Why is it not finding the objects by itself? A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 03 August 2020 15:03:05 To: ceph-users Subject: Ceph does not recover from OSD restart Dear cephers, I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that: cluster: id: xxx health: HEALTH_ERR 45813194/1492348700 objects misplaced (3.070%) Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized Degraded data redundancy (low space): 17 pgs backfill_toofull 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 6798138/1492348700 objects degraded (0.456%) 45813194/1492348700 objects misplaced (3.070%) 2903 active+clean 209 active+remapped+backfill_wait 73 active+undersized+degraded+remapped+backfill_wait 9 active+remapped+backfill_wait+backfill_toofull 8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull 4 active+undersized+degraded+remapped+backfilling 3 active+remapped+backfilling 3 active+clean+scrubbing+deep 1 active+clean+scrubbing 1 active+undersized+remapped+backfilling 1 active+clean+snaptrim io: client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr recovery: 195 MiB/s, 48 objects/s After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these: [root@gnosis ~]# ceph health detail HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169] 8...9 pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72] pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647] [...] pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86] PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146] [...] pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2] pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230] pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86] POOL_NEAR_FULL 1 pools nearfull pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB) It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool. What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot. Thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx