Hi Frank, On Tue, May 5, 2020 at 10:43 AM Frank Schilder <frans@xxxxxx> wrote: > Dear Dan, > > thank you for your fast response. Please find the log of the first OSD > that went down and the ceph.log with these links: > > https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l > https://files.dtu.dk/u/hPb5qax2-b6W9vmp/ceph-osd.2.log?l > > I can collect more osd logs if this helps. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dan van der Ster <dan@xxxxxxxxxxxxxx> > Sent: 05 May 2020 16:25:31 > To: Frank Schilder > Cc: ceph-users > Subject: Re: Ceph meltdown, need help > > Hi Frank, > > Could you share any ceph-osd logs and also the ceph.log from a mon to > see why the cluster thinks all those osds are down? > > Simply marking them up isn't going to help, I'm afraid. > > Cheers, Dan > > > On Tue, May 5, 2020 at 4:12 PM Frank Schilder <frans@xxxxxx> wrote: > > > > Hi all, > > > > a lot of OSDs crashed in our cluster. Mimic 13.2.8. Current status > included below. All daemons are running, no OSD process crashed. Can I > start marking OSDs in and up to get them back talking to each other? > > > > Please advice on next steps. Thanks!! > > > > [root@gnosis ~]# ceph status > > cluster: > > id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 > > health: HEALTH_WARN > > 2 MDSs report slow metadata IOs > > 1 MDSs report slow requests > > nodown,noout,norecover flag(s) set > > 125 osds down > > 3 hosts (48 osds) down > > Reduced data availability: 2221 pgs inactive, 1943 pgs down, > 190 pgs peering, 13 pgs stale > > Degraded data redundancy: 5134396/500993581 objects degraded > (1.025%), 296 pgs degraded, 299 pgs undersized > > 9622 slow ops, oldest one blocked for 2913 sec, daemons > [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]... > have slow ops. > > > > services: > > mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 > > mgr: ceph-02(active), standbys: ceph-03, ceph-01 > > mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay > > osd: 288 osds: 90 up, 215 in; 230 remapped pgs > > flags nodown,noout,norecover > > > > data: > > pools: 10 pools, 2545 pgs > > objects: 62.61 M objects, 144 TiB > > usage: 219 TiB used, 1.6 PiB / 1.8 PiB avail > > pgs: 1.729% pgs unknown > > 85.540% pgs not active > > 5134396/500993581 objects degraded (1.025%) > > 1796 down > > 226 active+undersized+degraded > > 147 down+remapped > > 140 peering > > 65 active+clean > > 44 unknown > > 38 undersized+degraded+peered > > 38 remapped+peering > > 17 active+undersized+degraded+remapped+backfill_wait > > 12 stale+peering > > 12 active+undersized+degraded+remapped+backfilling > > 4 active+undersized+remapped > > 2 remapped > > 2 undersized+degraded+remapped+peered > > 1 stale > > 1 undersized+degraded+remapped+backfilling+peered > > > > io: > > client: 26 KiB/s rd, 206 KiB/s wr, 21 op/s rd, 50 op/s wr > > > > [root@gnosis ~]# ceph health detail > > HEALTH_WARN 2 MDSs report slow metadata IOs; 1 MDSs report slow > requests; nodown,noout,norecover flag(s) set; 125 osds down; 3 hosts (48 > osds) down; Reduced data availability: 2219 pgs inactive, 1943 pgs down, > 188 pgs peering, 13 pgs stale; Degraded data redundancy: 5214696/500993589 > objects degraded (1.041%), 298 pgs degraded, 299 pgs undersized; 9788 slow > ops, oldest one blocked for 2953 sec, daemons > [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]... > have slow ops. > > MDS_SLOW_METADATA_IO 2 MDSs report slow metadata IOs > > mdsceph-08(mds.0): 100+ slow metadata IOs are blocked > 30 secs, > oldest blocked for 2940 secs > > mdsceph-12(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest > blocked for 2942 secs > > MDS_SLOW_REQUEST 1 MDSs report slow requests > > mdsceph-08(mds.0): 100 slow requests are blocked > 30 secs > > OSDMAP_FLAGS nodown,noout,norecover flag(s) set > > OSD_DOWN 125 osds down > > osd.0 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is > down > > osd.6 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.7 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.8 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.16 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.18 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.19 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.21 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.31 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is > down > > osd.37 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is > down > > osd.38 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is > down > > osd.48 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is > down > > osd.51 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) is > down > > osd.53 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is > down > > osd.55 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is > down > > osd.62 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.67 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.72 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is > down > > osd.75 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.78 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.79 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.80 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.81 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.82 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) > is down > > osd.83 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.88 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.89 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.92 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.93 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.95 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.96 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.97 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.100 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.104 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.105 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.107 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.108 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.109 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.111 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) > is down > > osd.113 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.114 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) > is down > > osd.116 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.117 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.119 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.122 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.123 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.124 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.125 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) > is down > > osd.126 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.128 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.131 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.134 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.139 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.140 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.141 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.145 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is > down > > osd.149 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.151 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) > is down > > osd.152 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.153 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.154 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) > is down > > osd.155 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.156 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is > down > > osd.157 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is > down > > osd.159 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is > down > > osd.161 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) > is down > > osd.162 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.164 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.165 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.166 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.167 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) > is down > > osd.171 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.172 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is > down > > osd.174 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.176 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.177 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-13) > is down > > osd.179 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.182 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-06) is > down > > osd.183 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-07) is > down > > osd.184 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.186 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.187 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.190 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-14) > is down > > osd.191 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-15) > is down > > osd.194 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.195 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.196 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.199 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.200 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.201 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.202 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.203 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.204 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.208 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.210 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-08) > is down > > osd.212 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.213 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.214 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) > is down > > osd.215 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-10) > is down > > osd.216 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.218 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-09) > is down > > osd.219 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-11) > is down > > osd.221 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-12) > is down > > osd.224 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-16) > is down > > osd.226 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1,host=ceph-17) > is down > > osd.228 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is > down > > osd.230 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is > down > > osd.233 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is > down > > osd.236 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is > down > > osd.238 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is > down > > osd.247 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is > down > > osd.248 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is > down > > osd.254 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is > down > > osd.256 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-04) is > down > > osd.259 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is > down > > osd.260 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is > down > > osd.262 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is > down > > osd.266 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is > down > > osd.267 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-18) is > down > > osd.272 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-20) is > down > > osd.274 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-21) is > down > > osd.275 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-19) is > down > > osd.276 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) is > down > > osd.281 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-22) is > down > > osd.285 > (root=DTU,region=Risoe,datacenter=ServerRoom,room=SR-113,host=ceph-05) is > down > > OSD_HOST_DOWN 3 hosts (48 osds) down > > host ceph-11 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16 osds) > is down > > host ceph-10 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16 osds) > is down > > host ceph-13 > (root=DTU,region=Risoe,datacenter=ContainerSquare,room=CON-161A1) (16 osds) > is down > > PG_AVAILABILITY Reduced data availability: 2219 pgs inactive, 1943 pgs > down, 188 pgs peering, 13 pgs stale > > pg 14.513 is stuck inactive for 1681.564244, current state down, > last acting > [2147483647,2147483647,2147483647,2147483647,2147483647,143,2147483647,2147483647,2147483647,2147483647] > > pg 14.514 is down, acting > [193,2147483647,2147483647,2147483647,2147483647,118,2147483647,2147483647,2147483647,2147483647] > > pg 14.515 is down, acting > [2147483647,2147483647,2147483647,211,133,135,2147483647,2147483647,2147483647,2147483647] > > pg 14.516 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205,2147483647] > > pg 14.517 is down, acting > [2147483647,2147483647,5,2147483647,2147483647,2147483647,2147483647,2147483647,61,112] > > pg 14.518 is down, acting > [2147483647,198,2147483647,2147483647,2147483647,2147483647,4,185,2147483647,2147483647] > > pg 14.519 is down, acting > [2147483647,2147483647,68,2147483647,2147483647,2147483647,2147483647,185,2147483647,94] > > pg 14.51a is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,101,2147483647] > > pg 14.51b is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,197,2147483647,2147483647,2147483647,2147483647] > > pg 14.51c is down, acting > [193,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,197] > > pg 14.51d is down, acting > [2147483647,2147483647,61,2147483647,77,2147483647,2147483647,2147483647,112,2147483647] > > pg 14.51e is down, acting > [2147483647,2147483647,2147483647,2147483647,112,2147483647,2147483647,193,2147483647,2147483647] > > pg 14.51f is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,94,2147483647,2147483647] > > pg 14.520 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,207,2147483647,101,133,2147483647] > > pg 14.521 is down, acting > [205,2147483647,133,2147483647,2147483647,2147483647,2147483647,4,2147483647,193] > > pg 14.522 is down, acting > [101,2147483647,2147483647,11,197,2147483647,136,94,2147483647,2147483647] > > pg 14.523 is down, acting > [2147483647,2147483647,2147483647,118,2147483647,71,2147483647,2147483647,2147483647,2147483647] > > pg 14.524 is down, acting > [2147483647,111,2147483647,2147483647,2147483647,8,2147483647,112,2147483647,2147483647] > > pg 14.525 is down, acting > [2147483647,2147483647,2147483647,142,2147483647,61,2147483647,2147483647,2147483647,2147483647] > > pg 14.526 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,61,193,2147483647,2147483647,2147483647] > > pg 14.527 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,109,2147483647,2147483647] > > pg 14.528 is down, acting > [2147483647,133,2147483647,2147483647,2147483647,2147483647,4,2147483647,2147483647,2147483647] > > pg 14.529 is down, acting > [2147483647,112,2147483647,2147483647,2147483647,2147483647,185,2147483647,118,2147483647] > > pg 14.52a is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,136,2147483647,135,2147483647,2147483647] > > pg 14.52b is down, acting > [2147483647,2147483647,2147483647,112,142,211,2147483647,2147483647,2147483647,2147483647] > > pg 14.52c is down, acting > [185,2147483647,198,2147483647,118,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 14.52d is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,5,2147483647,2147483647,2147483647] > > pg 14.52e is down, acting > [71,101,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,142,2147483647] > > pg 14.52f is down, acting > [198,2147483647,2147483647,2147483647,2147483647,11,2147483647,2147483647,118,2147483647] > > pg 14.530 is down, acting > [142,2147483647,2147483647,2147483647,133,2147483647,2147483647,2147483647,2147483647,112] > > pg 14.531 is down, acting > [2147483647,142,2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647,2147483647] > > pg 14.532 is down, acting > [135,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,136,118] > > pg 14.533 is down, acting > [2147483647,77,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 14.534 is down, acting > [2147483647,2147483647,2147483647,185,118,2147483647,2147483647,207,2147483647,2147483647] > > pg 14.535 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,136,142,133,2147483647] > > pg 14.536 is down, acting > [2147483647,11,2147483647,2147483647,136,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 14.537 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,77,2147483647] > > pg 14.538 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205,2147483647,2147483647] > > pg 14.539 is down, acting > [2147483647,2147483647,2147483647,198,2147483647,2147483647,4,2147483647,2147483647,2147483647] > > pg 14.53a is down, acting > [2147483647,11,136,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 14.53b is down, acting > [2147483647,2147483647,2147483647,2147483647,112,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 14.53c is down, acting > [2147483647,2147483647,2147483647,71,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 14.53d is down, acting > [2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647,2147483647,2147483647,136] > > pg 14.53e is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,112,185] > > pg 14.53f is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,185,2147483647,2147483647,2147483647] > > pg 14.540 is down, acting > [205,2147483647,2147483647,2147483647,2147483647,2147483647,142,2147483647,112,77] > > pg 14.541 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,197,211,2147483647,2147483647,2147483647] > > pg 14.542 is down, acting > [112,2147483647,101,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647] > > pg 14.543 is down, acting > [111,2147483647,2147483647,2147483647,2147483647,101,2147483647,2147483647,2147483647,2147483647] > > pg 14.544 is down, acting > [4,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,205] > > pg 14.545 is down, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,142,5,2147483647,2147483647,2147483647] > > PG_DEGRADED Degraded data redundancy: 5214696/500993589 objects degraded > (1.041%), 298 pgs degraded, 299 pgs undersized > > pg 1.29 is stuck undersized for 2075.633328, current state > active+undersized+degraded, last acting [253,258] > > pg 1.2a is stuck undersized for 1642.864920, current state > active+undersized+degraded, last acting [252,255] > > pg 1.2b is stuck undersized for 2355.149928, current state > active+undersized+degraded+remapped+backfill_wait, last acting [240,268] > > pg 1.2c is stuck undersized for 1459.277329, current state > active+undersized+degraded, last acting [241,273] > > pg 1.2d is stuck undersized for 803.339131, current state > undersized+degraded+peered, last acting [282] > > pg 2.25 is active+undersized+degraded, acting > [253,2147483647,2147483647,258,261,273,277,243] > > pg 2.28 is stuck undersized for 803.340163, current state > active+undersized+degraded, last acting > [282,241,246,2147483647,273,252,2147483647,268] > > pg 2.29 is stuck undersized for 803.341160, current state > active+undersized+degraded, last acting > [240,258,277,264,2147483647,2147483647,271,250] > > pg 2.2a is stuck undersized for 1447.684978, current state > active+undersized+degraded+remapped+backfilling, last acting > [252,270,2147483647,261,2147483647,255,287,264] > > pg 2.2e is stuck undersized for 2030.849944, current state > active+undersized+degraded, last acting > [264,2147483647,251,245,257,286,261,258] > > pg 2.51 is stuck undersized for 1459.274671, current state > active+undersized+degraded+remapped+backfilling, last acting > [270,2147483647,2147483647,265,241,243,240,252] > > pg 2.52 is stuck undersized for 2030.850897, current state > active+undersized+degraded+remapped+backfilling, last acting > [240,2147483647,270,265,269,280,278,2147483647] > > pg 2.53 is stuck undersized for 1459.273517, current state > active+undersized+degraded, last acting > [261,2147483647,280,282,2147483647,245,243,241] > > pg 2.61 is stuck undersized for 2075.633140, current state > active+undersized+degraded+remapped+backfilling, last acting > [269,2147483647,258,286,270,255,2147483647,264] > > pg 2.62 is stuck undersized for 803.340577, current state > active+undersized+degraded, last acting > [2147483647,253,258,2147483647,250,287,264,284] > > pg 2.66 is stuck undersized for 803.341231, current state > active+undersized+degraded, last acting > [264,280,265,255,257,269,2147483647,270] > > pg 2.6c is stuck undersized for 963.369539, current state > active+undersized+degraded, last acting > [286,269,278,251,2147483647,273,2147483647,280] > > pg 2.70 is stuck undersized for 873.662725, current state > active+undersized+degraded, last acting > [2147483647,268,255,273,253,265,278,2147483647] > > pg 2.74 is stuck undersized for 2075.632312, current state > active+undersized+degraded+remapped+backfilling, last acting > [240,242,2147483647,245,243,269,2147483647,265] > > pg 3.24 is stuck undersized for 1570.800184, current state > active+undersized+degraded, last acting [235,263] > > pg 3.25 is stuck undersized for 733.673503, current state > undersized+degraded+peered, last acting [232] > > pg 3.28 is stuck undersized for 2610.307886, current state > active+undersized+degraded, last acting [263,84] > > pg 3.2a is stuck undersized for 1214.710839, current state > active+undersized+degraded, last acting [181,232] > > pg 3.2b is stuck undersized for 2075.630671, current state > active+undersized+degraded, last acting [63,144] > > pg 3.52 is stuck undersized for 1570.777598, current state > active+undersized+degraded, last acting [158,237] > > pg 3.54 is stuck undersized for 1350.257189, current state > active+undersized+degraded, last acting [239,74] > > pg 3.55 is stuck undersized for 2592.642531, current state > active+undersized+degraded, last acting [157,233] > > pg 3.5a is stuck undersized for 2075.608257, current state > undersized+degraded+peered, last acting [168] > > pg 3.5c is stuck undersized for 733.674836, current state > active+undersized+degraded, last acting [263,234] > > pg 3.5d is stuck undersized for 2610.307220, current state > active+undersized+degraded, last acting [180,84] > > pg 3.5e is stuck undersized for 1710.756037, current state > undersized+degraded+peered, last acting [146] > > pg 3.61 is stuck undersized for 1080.210021, current state > active+undersized+degraded, last acting [168,239] > > pg 3.62 is stuck undersized for 831.217622, current state > active+undersized+degraded, last acting [84,263] > > pg 3.63 is stuck undersized for 733.674204, current state > active+undersized+degraded, last acting [263,232] > > pg 3.65 is stuck undersized for 1570.790824, current state > active+undersized+degraded, last acting [63,84] > > pg 3.66 is stuck undersized for 733.682973, current state > undersized+degraded+peered, last acting [63] > > pg 3.68 is stuck undersized for 1570.624462, current state > active+undersized+degraded, last acting [229,148] > > pg 3.69 is stuck undersized for 1350.316213, current state > undersized+degraded+peered, last acting [235] > > pg 3.6b is stuck undersized for 783.813654, current state > undersized+degraded+peered, last acting [63] > > pg 3.6c is stuck undersized for 783.819083, current state > undersized+degraded+peered, last acting [229] > > pg 3.6f is stuck undersized for 2610.321349, current state > active+undersized+degraded, last acting [232,158] > > pg 3.72 is stuck undersized for 1350.358149, current state > active+undersized+degraded, last acting [229,74] > > pg 3.73 is stuck undersized for 1570.788310, current state > undersized+degraded+peered, last acting [234] > > pg 11.20 is stuck undersized for 733.682510, current state > active+undersized+degraded, last acting > [2147483647,239,87,2147483647,158,237,63,76] > > pg 11.26 is stuck undersized for 1914.334332, current state > active+undersized+degraded, last acting > [2147483647,237,2147483647,263,158,148,181,180] > > pg 11.2d is stuck undersized for 1350.365988, current state > active+undersized+degraded, last acting > [2147483647,2147483647,73,229,86,158,169,84] > > pg 11.54 is stuck undersized for 1914.398125, current state > active+undersized+degraded, last acting > [231,169,2147483647,229,84,85,237,63] > > pg 11.5b is stuck undersized for 2047.980719, current state > active+undersized+degraded, last acting > [86,237,168,263,144,1,229,2147483647] > > pg 11.5e is stuck undersized for 873.643661, current state > active+undersized+degraded, last acting > [181,2147483647,229,158,231,1,169,2147483647] > > pg 11.62 is stuck undersized for 1144.491696, current state > active+undersized+degraded, last acting > [2147483647,85,235,74,63,234,181,2147483647] > > pg 11.6f is stuck undersized for 873.646628, current state > active+undersized+degraded, last acting > [234,3,2147483647,158,180,63,2147483647,181] > > SLOW_OPS 9788 slow ops, oldest one blocked for 2953 sec, daemons > [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]... > have slow ops. > > > I don't want to butt in, but I looked at your OSD log and saw these messages: 2020-05-05 15:28:09.593 7f2d9cf29700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.2 down, but it is still running 2020-05-05 15:28:09.593 7f2d9cf29700 0 log_channel(cluster) log [DBG] : map e112673 wrongly marked me down at e112634 As far as I know, this happens when an OSD is under stress, whether by IO, or network communications being saturated. I typically inject a large recovery sleep values and see if the OSDs come back, like so: ceph tell osd.* injectargs '--osd-recovery-sleep 1' ceph tell osd.* injectargs '--osd-max-backfills 1' Hope this helps. -- Alex Gorbachev Intelligent Systems Services Inc. > > > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx