Hi all, we had a network outage tonight (power loss) and restored network in the morning. All OSDs were running during this period. After restoring network peering hell broke loose and the cluster has a hard time coming back up again. OSDs get marked down all the time and come back later. Peering never stops. Below is the current status, I had all OSDs shown as up for a while, but many were not responsive. Are there some flags that help bringing things up in a sequence that causes less overload on the system? [root@gnosis ~]# ceph status cluster: id: XXX health: HEALTH_WARN 2 clients failing to respond to capability release 6 MDSs report slow metadata IOs 3 MDSs report slow requests nodown,noout,nobackfill,norecover flag(s) set 176 osds down Slow OSD heartbeats on back (longest 551718.679ms) Slow OSD heartbeats on front (longest 549598.330ms) Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161 pgs peering, 1341 pgs stale Degraded data redundancy: 1187354920/16402772667 objects degraded (7.239%), 6222 pgs degraded, 6231 pgs undersized 1 pools nearfull 17386 slow ops, oldest one blocked for 1811 sec, daemons [osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]... have slow ops. services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m) mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02, ceph-03 mds: con-fs2:8 4 up:standby 8 up:active osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs flags nodown,noout,nobackfill,norecover data: pools: 14 pools, 25065 pgs objects: 1.91G objects, 3.4 PiB usage: 3.1 PiB used, 6.0 PiB / 9.0 PiB avail pgs: 0.626% pgs unknown 31.566% pgs not active 1187354920/16402772667 objects degraded (7.239%) 51/16402772667 objects misplaced (0.000%) 11706 active+clean 4752 active+undersized+degraded 3286 down 2702 peering 799 undersized+degraded+peered 464 stale+down 418 stale+active+undersized+degraded 214 remapped+peering 157 unknown 128 stale+peering 117 stale+remapped+peering 101 stale+undersized+degraded+peered 57 stale+active+undersized+degraded+remapped+backfilling 35 down+remapped 26 stale+undersized+degraded+remapped+backfilling+peered 23 undersized+degraded+remapped+backfilling+peered 14 active+clean+scrubbing+deep 9 stale+active+undersized+degraded+remapped+backfill_wait 7 active+recovering+undersized+degraded 7 stale+active+recovering+undersized+degraded 6 active+undersized+degraded+remapped+backfilling 6 active+undersized 5 active+undersized+degraded+remapped+backfill_wait 5 stale+remapped 4 stale+activating+undersized+degraded 3 active+undersized+remapped 3 stale+undersized+degraded+remapped+backfill_wait+peered 1 activating+undersized+degraded 1 activating+undersized+degraded+remapped 1 undersized+degraded+remapped+backfill_wait+peered 1 stale+active+clean 1 active+recovering 1 stale+down+remapped 1 undersized+peered 1 active+undersized+degraded+remapped 1 active+clean+scrubbing 1 active+clean+remapped 1 active+recovering+degraded io: client: 1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr Thanks for any hints! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx