Re: Cluster down after network outage

Frank Schilder <frans@xxxxxx> · Thu, 13 Jul 2023 15:07:16 +0000

Hi all,

trying to answer Dan and Stefan in one go.

The OSDs were not booting (understood as the ceph-osd daemon starting up), they were running all the time. If they actually boot, they come up quite fast even though they are HDD and collocated. The time to be marked up is usually 30-60s after start, completely fine with me. We are not affected by pglog-dup accumulation or other issues causing excessive boot time and memory consumption.

On the affected part of the cluster we have 972 OSDs running on 900 disks, 48 SSDs with 1 or 4 OSDs depending on size and 852 HDDs with 1 OSD. All OSDs were up and running, but network to their hosts was down for a few hours. After network came up, a peering-recover-remap frenzy broke loose and we observed the death spiral described in this old post: https://narkive.com/KAzvjjPc.4

Specifically, just setting nodown didn't help, I guess our cluster is simply too large. I really had to stop recovery as well (norebalance had no effect). I had to do a bit of guessing when all OSDs are both, marked up and seen up at the same time (not trivial when the nodown flag is present), but after all OSDs were seen up for a while and the initial peering was over, enabling recovery now worked as expected and didn't lead to further OSDs being seen as down again.

In this particular situation, when the flag nodown is set, it would be extremely helpful to have an extra piece of info in the OSD part of ceph status showing the number "seen up" so that it becomes easier to be sure that an OSD that was marked up some time ago is not "seen down" but not marked down again due to the flag. This was really critical for getting the cluster back.

During all this, no parts of the hardware were saturated. However, the way ceph handles such a simultaneous OSD mass-return-to-life event seems not very clever. As soon as some PGs manage to peer into an active state, they start with recovery irrespective of whether or not they are remappings and the missing OSDs are actually there and waiting to be marked up. This in turn increases the load on the daemons unnecessarily. On top of this impatience-implied load amplification comes that the operations scheduler is not good at pushing high-priority messages like MON heart beats through to busy daemons.

As a consequence, one has a constant up-and down marking of OSDs even though the OSDs are actually running fine and nothing crashed. This implies constant re-peering of all PGs and accumulation of additional pglog history for any successful recovery or remapped write.

The strategy in such a such a situation would actually be to wait for every single OSD to be marked up, start peering and only then look if something needs recovery. Trying to do this all at the same time is just a mess.

In this particular case, setting osd_recovery_delay_start to a high value looks very promising to help dealing with such a situation in a bit more hands-free way - if this flag is doing something in octopus. I can see it available in the ceph config options on the command line but not in the docs (its implemented but not documented). Thanks for bringing this to my attention! Looking at the time it took for all OSDs to come up and peering to complete, 10-30 minutes might do the job for our cluster. 10-30 minutes on an HDD pool after simultaneously marking 900 OSDs up is actually a good time. I assume that recovery for a PG will be postponed whenever this PG (or one of its OSDs) has a peering event?

When looking at the sequence of events, I would actually also like an osd_up_peer_delay_start to postpone peering a bit in case several OSDs are starting up to avoid redundant re-peering. The ideal recovery sequence after a mass-down of OSDs really seems to be to do each of these as atomic steps:

- get all OSDs up
- peer all PGs
- start recovery if needed

Being able to delay peering after an OSD comes up would give some space for the "I'm up" messages to arrive with priority at the MONs. Then the PGs have time to say "I'm active+clean" and only then will the trash be taken out.

Thanks for replying to my messages and for pointing me to osd_recovery_delay_start.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Sent: Wednesday, July 12, 2023 6:58 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Cluster down after network outage

On Wed, Jul 12, 2023 at 1:26 AM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:

Hi all,

one problem solved, another coming up. For everyone ending up in the same situation, the trick seems to be to get all OSDs marked up and then allow recovery. Steps to take:

- set noout, nodown, norebalance, norecover
- wait patiently until all OSDs are shown as up
- unset norebalance, norecover
- wait wait wait, PGs will eventually become active as OSDs become responsive
- unset nodown, noout

Nice work bringing the cluster back up.
Looking into an OSD log would give more detail about why they were flapping. Are these HDDs? Are the block.dbs on flash?

Generally, I've found that on clusters having OSDs which are slow to boot and flapping up and down, "nodown" is sufficient to recover from such issues.

Cheers, Dan

______________________________________________________ Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com<https://www.clyso.com/>

Now the new problem. I now have an ever growing list of OSDs listed as rebalancing, but nothing is actually rebalancing. How can I stop this growth and how can I get rid of this list:

[root@gnosis ~]# ceph status
  cluster:
    id:     XXX
    health: HEALTH_WARN
            noout flag(s) set
            Slow OSD heartbeats on back (longest 634775.858ms)
            Slow OSD heartbeats on front (longest 635210.412ms)
            1 pools nearfull

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 6m)
    mgr: ceph-25(active, since 57m), standbys: ceph-26, ceph-01, ceph-02, ceph-03
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1260 osds: 1258 up (since 24m), 1258 in (since 45m)
         flags noout

  data:
    pools:   14 pools, 25065 pgs
    objects: 1.97G objects, 3.5 PiB
    usage:   4.4 PiB used, 8.7 PiB / 13 PiB avail
    pgs:     25028 active+clean
             30    active+clean+scrubbing+deep
             7     active+clean+scrubbing

  io:
    client:   1.3 GiB/s rd, 718 MiB/s wr, 7.71k op/s rd, 2.54k op/s wr

  progress:
    Rebalancing after osd.135 marked in (1s)
      [=====================.......]
    Rebalancing after osd.69 marked in (2s)
      [========================....]
    Rebalancing after osd.75 marked in (2s)
      [=======================.....]
    Rebalancing after osd.173 marked in (2s)
      [========================....]
    Rebalancing after osd.42 marked in (1s)
      [=============...............] (remaining: 2s)
    Rebalancing after osd.104 marked in (2s)
      [========================....]
    Rebalancing after osd.82 marked in (2s)
      [========================....]
    Rebalancing after osd.107 marked in (2s)
      [=======================.....]
    Rebalancing after osd.19 marked in (2s)
      [=======================.....]
    Rebalancing after osd.67 marked in (2s)
      [=====================.......]
    Rebalancing after osd.46 marked in (2s)
      [===================.........] (remaining: 1s)
    Rebalancing after osd.123 marked in (2s)
      [=======================.....]
    Rebalancing after osd.66 marked in (2s)
      [====================........]
    Rebalancing after osd.12 marked in (2s)
      [==============..............] (remaining: 2s)
    Rebalancing after osd.95 marked in (2s)
      [=====================.......]
    Rebalancing after osd.134 marked in (2s)
      [=======================.....]
    Rebalancing after osd.14 marked in (1s)
      [===================.........]
    Rebalancing after osd.56 marked in (2s)
      [=====================.......]
    Rebalancing after osd.143 marked in (1s)
      [========================....]
    Rebalancing after osd.118 marked in (2s)
      [=======================.....]
    Rebalancing after osd.96 marked in (2s)
      [========================....]
    Rebalancing after osd.105 marked in (2s)
      [=======================.....]
    Rebalancing after osd.44 marked in (1s)
      [=======.....................] (remaining: 5s)
    Rebalancing after osd.41 marked in (1s)
      [==============..............] (remaining: 1s)
    Rebalancing after osd.9 marked in (2s)
      [=...........................] (remaining: 37s)
    Rebalancing after osd.58 marked in (2s)
      [======......................] (remaining: 8s)
    Rebalancing after osd.140 marked in (1s)
      [=======================.....]
    Rebalancing after osd.132 marked in (2s)
      [========================....]
    Rebalancing after osd.31 marked in (1s)
      [=========================...]
    Rebalancing after osd.110 marked in (2s)
      [========================....]
    Rebalancing after osd.21 marked in (2s)
      [=========================...]
    Rebalancing after osd.114 marked in (2s)
      [=======================.....]
    Rebalancing after osd.83 marked in (2s)
      [=======================.....]
    Rebalancing after osd.23 marked in (1s)
      [=======================.....]
    Rebalancing after osd.25 marked in (1s)
      [==========================..]
    Rebalancing after osd.147 marked in (2s)
      [========================....]
    Rebalancing after osd.62 marked in (1s)
      [======================......]
    Rebalancing after osd.57 marked in (2s)
      [======================......]
    Rebalancing after osd.61 marked in (2s)
      [====================........]
    Rebalancing after osd.71 marked in (2s)
      [===================.........]
    Rebalancing after osd.80 marked in (2s)
      [======================......]
    Rebalancing after osd.92 marked in (2s)
      [=====================.......]
    Rebalancing after osd.171 marked in (2s)
      [========================....]
    Rebalancing after osd.11 marked in (2s)
      [===========.................] (remaining: 2s)
    Rebalancing after osd.90 marked in (2s)
      [====================........]
    Rebalancing after osd.54 marked in (2s)
      [====================........]
    Rebalancing after osd.45 marked in (2s)
      [===================.........] (remaining: 1s)
    Rebalancing after osd.53 marked in (1s)
      [====================........]
    Rebalancing after osd.22 marked in (3s)
      [=======================.....]
    Rebalancing after osd.27 marked in (2s)
      [========================....]
    Rebalancing after osd.37 marked in (2s)
      [===.........................] (remaining: 14s)
    Rebalancing after osd.94 marked in (2s)
      [=======================.....]
    Rebalancing after osd.55 marked in (2s)
      [=====.......................] (remaining: 10s)
    Rebalancing after osd.35 marked in (2s)
      [=...........................] (remaining: 31s)
    Rebalancing after osd.43 marked in (2s)
      [================............] (remaining: 2s)
    Rebalancing after osd.13 marked in (2s)
      [=============...............] (remaining: 2s)
    Rebalancing after osd.79 marked in (2s)
      [=========================...]
    Rebalancing after osd.50 marked in (2s)
      [======......................] (remaining: 7s)
    Rebalancing after osd.33 marked in (1s)
      [............................]
    Rebalancing after osd.20 marked in (1s)
      [=======================.....]
    Rebalancing after osd.59 marked in (2s)
      [=====================.......]
    Rebalancing after osd.101 marked in (2s)
      [======================......]
    Rebalancing after osd.49 marked in (2s)
      [=====.......................] (remaining: 9s)
    Rebalancing after osd.36 marked in (2s)
      [==..........................] (remaining: 20s)
    Rebalancing after osd.133 marked in (2s)
      [=======================.....]
    Rebalancing after osd.29 marked in (2s)
      [======================......]
    Rebalancing after osd.8 marked in (2s)
      [===.........................] (remaining: 14s)
    Rebalancing after osd.16 marked in (2s)
      [========================....]
    Rebalancing after osd.38 marked in (2s)
      [===========.................] (remaining: 2s)
    Rebalancing after osd.68 marked in (2s)
      [=======================.....]
    Rebalancing after osd.130 marked in (2s)
      [======================......]
    Rebalancing after osd.117 marked in (2s)
      [======================......]
    Rebalancing after osd.155 marked in (2s)
      [========================....]
    Rebalancing after osd.10 marked in (2s)
      [==============..............] (remaining: 1s)
    Rebalancing after osd.141 marked in (1s)
      [=======================.....]
    Rebalancing after osd.52 marked in (2s)
      [====================........] (remaining: 1s)
    Rebalancing after osd.177 marked in (1s)
      [=======================.....]
    Rebalancing after osd.97 marked in (1s)
      [=======================.....]
    Rebalancing after osd.98 marked in (1s)
      [======================......]
    Rebalancing after osd.88 marked in (2s)
      [=====================.......]
    Rebalancing after osd.116 marked in (2s)
      [========================....]
    Rebalancing after osd.108 marked in (2s)
      [======================......]
    Rebalancing after osd.17 marked in (1s)
      [=====================.......]
    Rebalancing after osd.129 marked in (2s)
      [====================........]
    Rebalancing after osd.167 marked in (2s)
      [======================......]
    Rebalancing after osd.152 marked in (2s)
      [=======================.....]
    Rebalancing after osd.77 marked in (2s)
      [=======================.....]
    Rebalancing after osd.5 marked in (2s)
      [========....................] (remaining: 5s)
    Rebalancing after osd.121 marked in (1s)
      [======================......]
    Rebalancing after osd.26 marked in (2s)
      [==========================..]
    Rebalancing after osd.91 marked in (2s)
      [=======================.....]
    Rebalancing after osd.81 marked in (2s)
      [========================....]
    Rebalancing after osd.48 marked in (2s)
      [=====.......................] (remaining: 9s)
    Rebalancing after osd.32 marked in (2s)
      [=====================.......]
    Rebalancing after osd.125 marked in (2s)
      [========================....]
    Rebalancing after osd.111 marked in (2s)
      [======================......]
    Rebalancing after osd.151 marked in (2s)
      [======================......]
    Rebalancing after osd.39 marked in (2s)
      [============................] (remaining: 2s)
    Rebalancing after osd.136 marked in (2s)
      [========================....]
    Rebalancing after osd.112 marked in (1s)
      [=========================...]
    Rebalancing after osd.154 marked in (1s)
      [=========================...]
    Rebalancing after osd.64 marked in (2s)
      [===================.........]
    Rebalancing after osd.34 marked in (2s)
      [............................] (remaining: 90s)
    Rebalancing after osd.161 marked in (1s)
      [========================....]
    Rebalancing after osd.160 marked in (2s)
      [=======================.....]
    Rebalancing after osd.142 marked in (2s)
      [=======================.....]

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>>
Sent: Wednesday, July 12, 2023 9:53 AM
To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Subject:  Cluster down after network outage

Hi all,

we had a network outage tonight (power loss) and restored network in the morning. All OSDs were running during this period. After restoring network peering hell broke loose and the cluster has a hard time coming back up again. OSDs get marked down all the time and come back later. Peering never stops.

Below is the current status, I had all OSDs shown as up for a while, but many were not responsive. Are there some flags that help bringing things up in a sequence that causes less overload on the system?

[root@gnosis ~]# ceph status
  cluster:
    id:     XXX
    health: HEALTH_WARN
            2 clients failing to respond to capability release
            6 MDSs report slow metadata IOs
            3 MDSs report slow requests
            nodown,noout,nobackfill,norecover flag(s) set
            176 osds down
            Slow OSD heartbeats on back (longest 551718.679ms)
            Slow OSD heartbeats on front (longest 549598.330ms)
            Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161 pgs peering, 1341 pgs stale
            Degraded data redundancy: 1187354920/16402772667 objects degraded (7.239%), 6222 pgs degraded, 6231 pgs undersized
            1 pools nearfull
            17386 slow ops, oldest one blocked for 1811 sec, daemons [osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]... have slow ops.

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m)
    mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02, ceph-03
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs
         flags nodown,noout,nobackfill,norecover

  data:
    pools:   14 pools, 25065 pgs
    objects: 1.91G objects, 3.4 PiB
    usage:   3.1 PiB used, 6.0 PiB / 9.0 PiB avail
    pgs:     0.626% pgs unknown
             31.566% pgs not active
             1187354920/16402772667 objects degraded (7.239%)
             51/16402772667 objects misplaced (0.000%)
             11706 active+clean
             4752  active+undersized+degraded
             3286  down
             2702  peering
             799   undersized+degraded+peered
             464   stale+down
             418   stale+active+undersized+degraded
             214   remapped+peering
             157   unknown
             128   stale+peering
             117   stale+remapped+peering
             101   stale+undersized+degraded+peered
             57    stale+active+undersized+degraded+remapped+backfilling
             35    down+remapped
             26    stale+undersized+degraded+remapped+backfilling+peered
             23    undersized+degraded+remapped+backfilling+peered
             14    active+clean+scrubbing+deep
             9     stale+active+undersized+degraded+remapped+backfill_wait
             7     active+recovering+undersized+degraded
             7     stale+active+recovering+undersized+degraded
             6     active+undersized+degraded+remapped+backfilling
             6     active+undersized
             5     active+undersized+degraded+remapped+backfill_wait
             5     stale+remapped
             4     stale+activating+undersized+degraded
             3     active+undersized+remapped
             3     stale+undersized+degraded+remapped+backfill_wait+peered
             1     activating+undersized+degraded
             1     activating+undersized+degraded+remapped
             1     undersized+degraded+remapped+backfill_wait+peered
             1     stale+active+clean
             1     active+recovering
             1     stale+down+remapped
             1     undersized+peered
             1     active+undersized+degraded+remapped
             1     active+clean+scrubbing
             1     active+clean+remapped
             1     active+recovering+degraded

  io:
    client:   1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr

Thanks for any hints!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx