After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see: cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN norebalance,norecover flag(s) set 53030026/1492404361 objects misplaced (3.553%) 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs flags norebalance,norecover data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 53030026/1492404361 objects misplaced (3.553%) 2902 active+clean 299 active+remapped+backfill_wait 8 active+remapped+backfilling 5 active+clean+scrubbing+deep 1 active+clean+snaptrim io: client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects? Why is it not finding the objects by itself? A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 03 August 2020 15:03:05 To: ceph-users Subject: Ceph does not recover from OSD restart Dear cephers, I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that: cluster: id: xxx health: HEALTH_ERR 45813194/1492348700 objects misplaced (3.070%) Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized Degraded data redundancy (low space): 17 pgs backfill_toofull 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 6798138/1492348700 objects degraded (0.456%) 45813194/1492348700 objects misplaced (3.070%) 2903 active+clean 209 active+remapped+backfill_wait 73 active+undersized+degraded+remapped+backfill_wait 9 active+remapped+backfill_wait+backfill_toofull 8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull 4 active+undersized+degraded+remapped+backfilling 3 active+remapped+backfilling 3 active+clean+scrubbing+deep 1 active+clean+scrubbing 1 active+undersized+remapped+backfilling 1 active+clean+snaptrim io: client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr recovery: 195 MiB/s, 48 objects/s After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these: [root@gnosis ~]# ceph health detail HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169] 8...9 pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72] pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647] [...] pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86] PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146] [...] pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2] pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230] pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86] POOL_NEAR_FULL 1 pools nearfull pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB) It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool. What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot. Thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx