Hi all, one problem solved, another coming up. For everyone ending up in the same situation, the trick seems to be to get all OSDs marked up and then allow recovery. Steps to take: - set noout, nodown, norebalance, norecover - wait patiently until all OSDs are shown as up - unset norebalance, norecover - wait wait wait, PGs will eventually become active as OSDs become responsive - unset nodown, noout Now the new problem. I now have an ever growing list of OSDs listed as rebalancing, but nothing is actually rebalancing. How can I stop this growth and how can I get rid of this list: [root@gnosis ~]# ceph status cluster: id: XXX health: HEALTH_WARN noout flag(s) set Slow OSD heartbeats on back (longest 634775.858ms) Slow OSD heartbeats on front (longest 635210.412ms) 1 pools nearfull services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 6m) mgr: ceph-25(active, since 57m), standbys: ceph-26, ceph-01, ceph-02, ceph-03 mds: con-fs2:8 4 up:standby 8 up:active osd: 1260 osds: 1258 up (since 24m), 1258 in (since 45m) flags noout data: pools: 14 pools, 25065 pgs objects: 1.97G objects, 3.5 PiB usage: 4.4 PiB used, 8.7 PiB / 13 PiB avail pgs: 25028 active+clean 30 active+clean+scrubbing+deep 7 active+clean+scrubbing io: client: 1.3 GiB/s rd, 718 MiB/s wr, 7.71k op/s rd, 2.54k op/s wr progress: Rebalancing after osd.135 marked in (1s) [=====================.......] Rebalancing after osd.69 marked in (2s) [========================....] Rebalancing after osd.75 marked in (2s) [=======================.....] Rebalancing after osd.173 marked in (2s) [========================....] Rebalancing after osd.42 marked in (1s) [=============...............] (remaining: 2s) Rebalancing after osd.104 marked in (2s) [========================....] Rebalancing after osd.82 marked in (2s) [========================....] Rebalancing after osd.107 marked in (2s) [=======================.....] Rebalancing after osd.19 marked in (2s) [=======================.....] Rebalancing after osd.67 marked in (2s) [=====================.......] Rebalancing after osd.46 marked in (2s) [===================.........] (remaining: 1s) Rebalancing after osd.123 marked in (2s) [=======================.....] Rebalancing after osd.66 marked in (2s) [====================........] Rebalancing after osd.12 marked in (2s) [==============..............] (remaining: 2s) Rebalancing after osd.95 marked in (2s) [=====================.......] Rebalancing after osd.134 marked in (2s) [=======================.....] Rebalancing after osd.14 marked in (1s) [===================.........] Rebalancing after osd.56 marked in (2s) [=====================.......] Rebalancing after osd.143 marked in (1s) [========================....] Rebalancing after osd.118 marked in (2s) [=======================.....] Rebalancing after osd.96 marked in (2s) [========================....] Rebalancing after osd.105 marked in (2s) [=======================.....] Rebalancing after osd.44 marked in (1s) [=======.....................] (remaining: 5s) Rebalancing after osd.41 marked in (1s) [==============..............] (remaining: 1s) Rebalancing after osd.9 marked in (2s) [=...........................] (remaining: 37s) Rebalancing after osd.58 marked in (2s) [======......................] (remaining: 8s) Rebalancing after osd.140 marked in (1s) [=======================.....] Rebalancing after osd.132 marked in (2s) [========================....] Rebalancing after osd.31 marked in (1s) [=========================...] Rebalancing after osd.110 marked in (2s) [========================....] Rebalancing after osd.21 marked in (2s) [=========================...] Rebalancing after osd.114 marked in (2s) [=======================.....] Rebalancing after osd.83 marked in (2s) [=======================.....] Rebalancing after osd.23 marked in (1s) [=======================.....] Rebalancing after osd.25 marked in (1s) [==========================..] Rebalancing after osd.147 marked in (2s) [========================....] Rebalancing after osd.62 marked in (1s) [======================......] Rebalancing after osd.57 marked in (2s) [======================......] Rebalancing after osd.61 marked in (2s) [====================........] Rebalancing after osd.71 marked in (2s) [===================.........] Rebalancing after osd.80 marked in (2s) [======================......] Rebalancing after osd.92 marked in (2s) [=====================.......] Rebalancing after osd.171 marked in (2s) [========================....] Rebalancing after osd.11 marked in (2s) [===========.................] (remaining: 2s) Rebalancing after osd.90 marked in (2s) [====================........] Rebalancing after osd.54 marked in (2s) [====================........] Rebalancing after osd.45 marked in (2s) [===================.........] (remaining: 1s) Rebalancing after osd.53 marked in (1s) [====================........] Rebalancing after osd.22 marked in (3s) [=======================.....] Rebalancing after osd.27 marked in (2s) [========================....] Rebalancing after osd.37 marked in (2s) [===.........................] (remaining: 14s) Rebalancing after osd.94 marked in (2s) [=======================.....] Rebalancing after osd.55 marked in (2s) [=====.......................] (remaining: 10s) Rebalancing after osd.35 marked in (2s) [=...........................] (remaining: 31s) Rebalancing after osd.43 marked in (2s) [================............] (remaining: 2s) Rebalancing after osd.13 marked in (2s) [=============...............] (remaining: 2s) Rebalancing after osd.79 marked in (2s) [=========================...] Rebalancing after osd.50 marked in (2s) [======......................] (remaining: 7s) Rebalancing after osd.33 marked in (1s) [............................] Rebalancing after osd.20 marked in (1s) [=======================.....] Rebalancing after osd.59 marked in (2s) [=====================.......] Rebalancing after osd.101 marked in (2s) [======================......] Rebalancing after osd.49 marked in (2s) [=====.......................] (remaining: 9s) Rebalancing after osd.36 marked in (2s) [==..........................] (remaining: 20s) Rebalancing after osd.133 marked in (2s) [=======================.....] Rebalancing after osd.29 marked in (2s) [======================......] Rebalancing after osd.8 marked in (2s) [===.........................] (remaining: 14s) Rebalancing after osd.16 marked in (2s) [========================....] Rebalancing after osd.38 marked in (2s) [===========.................] (remaining: 2s) Rebalancing after osd.68 marked in (2s) [=======================.....] Rebalancing after osd.130 marked in (2s) [======================......] Rebalancing after osd.117 marked in (2s) [======================......] Rebalancing after osd.155 marked in (2s) [========================....] Rebalancing after osd.10 marked in (2s) [==============..............] (remaining: 1s) Rebalancing after osd.141 marked in (1s) [=======================.....] Rebalancing after osd.52 marked in (2s) [====================........] (remaining: 1s) Rebalancing after osd.177 marked in (1s) [=======================.....] Rebalancing after osd.97 marked in (1s) [=======================.....] Rebalancing after osd.98 marked in (1s) [======================......] Rebalancing after osd.88 marked in (2s) [=====================.......] Rebalancing after osd.116 marked in (2s) [========================....] Rebalancing after osd.108 marked in (2s) [======================......] Rebalancing after osd.17 marked in (1s) [=====================.......] Rebalancing after osd.129 marked in (2s) [====================........] Rebalancing after osd.167 marked in (2s) [======================......] Rebalancing after osd.152 marked in (2s) [=======================.....] Rebalancing after osd.77 marked in (2s) [=======================.....] Rebalancing after osd.5 marked in (2s) [========....................] (remaining: 5s) Rebalancing after osd.121 marked in (1s) [======================......] Rebalancing after osd.26 marked in (2s) [==========================..] Rebalancing after osd.91 marked in (2s) [=======================.....] Rebalancing after osd.81 marked in (2s) [========================....] Rebalancing after osd.48 marked in (2s) [=====.......................] (remaining: 9s) Rebalancing after osd.32 marked in (2s) [=====================.......] Rebalancing after osd.125 marked in (2s) [========================....] Rebalancing after osd.111 marked in (2s) [======================......] Rebalancing after osd.151 marked in (2s) [======================......] Rebalancing after osd.39 marked in (2s) [============................] (remaining: 2s) Rebalancing after osd.136 marked in (2s) [========================....] Rebalancing after osd.112 marked in (1s) [=========================...] Rebalancing after osd.154 marked in (1s) [=========================...] Rebalancing after osd.64 marked in (2s) [===================.........] Rebalancing after osd.34 marked in (2s) [............................] (remaining: 90s) Rebalancing after osd.161 marked in (1s) [========================....] Rebalancing after osd.160 marked in (2s) [=======================.....] Rebalancing after osd.142 marked in (2s) [=======================.....] Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Wednesday, July 12, 2023 9:53 AM To: ceph-users@xxxxxxx Subject: Cluster down after network outage Hi all, we had a network outage tonight (power loss) and restored network in the morning. All OSDs were running during this period. After restoring network peering hell broke loose and the cluster has a hard time coming back up again. OSDs get marked down all the time and come back later. Peering never stops. Below is the current status, I had all OSDs shown as up for a while, but many were not responsive. Are there some flags that help bringing things up in a sequence that causes less overload on the system? [root@gnosis ~]# ceph status cluster: id: XXX health: HEALTH_WARN 2 clients failing to respond to capability release 6 MDSs report slow metadata IOs 3 MDSs report slow requests nodown,noout,nobackfill,norecover flag(s) set 176 osds down Slow OSD heartbeats on back (longest 551718.679ms) Slow OSD heartbeats on front (longest 549598.330ms) Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161 pgs peering, 1341 pgs stale Degraded data redundancy: 1187354920/16402772667 objects degraded (7.239%), 6222 pgs degraded, 6231 pgs undersized 1 pools nearfull 17386 slow ops, oldest one blocked for 1811 sec, daemons [osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]... have slow ops. services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m) mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02, ceph-03 mds: con-fs2:8 4 up:standby 8 up:active osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs flags nodown,noout,nobackfill,norecover data: pools: 14 pools, 25065 pgs objects: 1.91G objects, 3.4 PiB usage: 3.1 PiB used, 6.0 PiB / 9.0 PiB avail pgs: 0.626% pgs unknown 31.566% pgs not active 1187354920/16402772667 objects degraded (7.239%) 51/16402772667 objects misplaced (0.000%) 11706 active+clean 4752 active+undersized+degraded 3286 down 2702 peering 799 undersized+degraded+peered 464 stale+down 418 stale+active+undersized+degraded 214 remapped+peering 157 unknown 128 stale+peering 117 stale+remapped+peering 101 stale+undersized+degraded+peered 57 stale+active+undersized+degraded+remapped+backfilling 35 down+remapped 26 stale+undersized+degraded+remapped+backfilling+peered 23 undersized+degraded+remapped+backfilling+peered 14 active+clean+scrubbing+deep 9 stale+active+undersized+degraded+remapped+backfill_wait 7 active+recovering+undersized+degraded 7 stale+active+recovering+undersized+degraded 6 active+undersized+degraded+remapped+backfilling 6 active+undersized 5 active+undersized+degraded+remapped+backfill_wait 5 stale+remapped 4 stale+activating+undersized+degraded 3 active+undersized+remapped 3 stale+undersized+degraded+remapped+backfill_wait+peered 1 activating+undersized+degraded 1 activating+undersized+degraded+remapped 1 undersized+degraded+remapped+backfill_wait+peered 1 stale+active+clean 1 active+recovering 1 stale+down+remapped 1 undersized+peered 1 active+undersized+degraded+remapped 1 active+clean+scrubbing 1 active+clean+remapped 1 active+recovering+degraded io: client: 1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr Thanks for any hints! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx