Reminds me of https://tracker.ceph.com/issues/57007 which wasn't fixed in pacific until 16.2.11, so this is probably just the result of a cephadm bug unfortunately. On Fri, Jun 23, 2023 at 5:16 PM Malte Stroem <malte.stroem@xxxxxxxxx> wrote: > Hello Eugen, > > thanks. > > We found the cause. > > Somehow all > > /var/lib/ceph/fsid/osd.XX/config > > files on every host were still filled with expired information about the > mons. > > So refreshing the files helped to bring the osds up again. Damn. > > All other configs for the mons, mds', rgws and so on were up to date. > > I do not know why the osd config files did not get refreshed however I > guess something went wrong draining the nodes we removed from the cluster. > > Best regards, > Malte > > Am 21.06.23 um 22:11 schrieb Eugen Block: > > I still can’t really grasp what might have happened here. But could you > > please clarify which of the down OSDs (or Hosts) are supposed to be down > > and which you’re trying to bring back online? Obviously osd.40 is one of > > your attempts. But what about the hosts cephx01 and cephx08? Are those > > the ones refusing to start their OSDs? And the remaining up OSDs you > > haven’t touched yet, correct? > > And regarding debug logs, you should set it with ceph config set because > > the local ceph.conf won’t have an effect. It could help to have the > > startup debug logs from one of the OSDs. > > > > Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>: > > > >> Hello Eugen, > >> > >> recovery and rebalancing was finished however now all PGs show missing > >> OSDs. > >> > >> Everything looks like the PGs are missing OSDs although it finished > >> correctly. > >> > >> As if we shut down the servers immediately. > >> > >> But we removed the nodes the way it is described in the documentation. > >> > >> We just added new disks and they join the cluster immediately. > >> > >> So the old OSDs removed from the cluster are available, I restored > >> OSD.40 but it does not want to join the cluster. > >> > >> Following are the outputs of the mentioned commands: > >> > >> ceph -s > >> > >> cluster: > >> id: X > >> health: HEALTH_WARN > >> 1 failed cephadm daemon(s) > >> 1 filesystem is degraded > >> 1 MDSs report slow metadata IOs > >> 19 osds down > >> 4 hosts (50 osds) down > >> Reduced data availability: 1220 pgs inactive > >> Degraded data redundancy: 132 pgs undersized > >> > >> services: > >> mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m) > >> mgr: cephx02.xxxxxx(active, since 92s), standbys: cephx04.yyyyyy, > >> cephx06.zzzzzz mds: 2/2 daemons up, 2 standby > >> osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped > pgs > >> rgw: 1 daemon active (1 hosts, 1 zones) > >> > >> data: > >> volumes: 1/2 healthy, 1 recovering > >> pools: 12 pools, 1345 pgs > >> objects: 11.02k objects, 1.9 GiB > >> usage: 145 TiB used, 669 TiB / 814 TiB avail > >> pgs: 86.617% pgs unknown > >> 4.089% pgs not active > >> 39053/33069 objects misplaced (118.095%) > >> 1165 unknown > >> 77 active+undersized+remapped > >> 55 undersized+remapped+peered > >> 38 active+clean+remapped > >> 10 active+clean > >> > >> ceph osd tree > >> > >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > >> PRI-AFF > >> -21 4.36646 root ssds > >> -61 0.87329 host cephx01-ssd > >> 186 ssd 0.87329 osd.186 down 1.00000 > >> 1.00000 > >> -76 0.87329 host cephx02-ssd > >> 263 ssd 0.87329 osd.263 up 1.00000 > >> 1.00000 > >> -85 0.87329 host cephx04-ssd > >> 237 ssd 0.87329 osd.237 up 1.00000 > >> 1.00000 > >> -88 0.87329 host cephx06-ssd > >> 236 ssd 0.87329 osd.236 up 1.00000 > >> 1.00000 > >> -94 0.87329 host cephx08-ssd > >> 262 ssd 0.87329 osd.262 down 1.00000 > >> 1.00000 > >> -1 1347.07397 root default > >> -62 261.93823 host cephx01 > >> 139 hdd 10.91409 osd.139 down 0 > >> 1.00000 > >> 140 hdd 10.91409 osd.140 down 0 > >> 1.00000 > >> 142 hdd 10.91409 osd.142 down 0 > >> 1.00000 > >> 144 hdd 10.91409 osd.144 down 0 > >> 1.00000 > >> 146 hdd 10.91409 osd.146 down 0 > >> 1.00000 > >> 148 hdd 10.91409 osd.148 down 0 > >> 1.00000 > >> 150 hdd 10.91409 osd.150 down 0 > >> 1.00000 > >> 152 hdd 10.91409 osd.152 down 0 > >> 1.00000 > >> 154 hdd 10.91409 osd.154 down 1.00000 > >> 1.00000 > >> 156 hdd 10.91409 osd.156 down 1.00000 > >> 1.00000 > >> 158 hdd 10.91409 osd.158 down 1.00000 > >> 1.00000 > >> 160 hdd 10.91409 osd.160 down 1.00000 > >> 1.00000 > >> 162 hdd 10.91409 osd.162 down 1.00000 > >> 1.00000 > >> 164 hdd 10.91409 osd.164 down 1.00000 > >> 1.00000 > >> 166 hdd 10.91409 osd.166 down 1.00000 > >> 1.00000 > >> 168 hdd 10.91409 osd.168 down 1.00000 > >> 1.00000 > >> 170 hdd 10.91409 osd.170 down 1.00000 > >> 1.00000 > >> 172 hdd 10.91409 osd.172 down 1.00000 > >> 1.00000 > >> 174 hdd 10.91409 osd.174 down 1.00000 > >> 1.00000 > >> 176 hdd 10.91409 osd.176 down 1.00000 > >> 1.00000 > >> 178 hdd 10.91409 osd.178 down 1.00000 > >> 1.00000 > >> 180 hdd 10.91409 osd.180 down 1.00000 > >> 1.00000 > >> 182 hdd 10.91409 osd.182 down 1.00000 > >> 1.00000 > >> 184 hdd 10.91409 osd.184 down 1.00000 > >> 1.00000 > >> -67 261.93823 host cephx02 > >> 138 hdd 10.91409 osd.138 up 1.00000 > >> 1.00000 > >> 141 hdd 10.91409 osd.141 up 1.00000 > >> 1.00000 > >> 143 hdd 10.91409 osd.143 up 1.00000 > >> 1.00000 > >> 145 hdd 10.91409 osd.145 up 1.00000 > >> 1.00000 > >> 147 hdd 10.91409 osd.147 up 1.00000 > >> 1.00000 > >> 149 hdd 10.91409 osd.149 up 1.00000 > >> 1.00000 > >> 151 hdd 10.91409 osd.151 up 1.00000 > >> 1.00000 > >> 153 hdd 10.91409 osd.153 up 1.00000 > >> 1.00000 > >> 155 hdd 10.91409 osd.155 up 1.00000 > >> 1.00000 > >> 157 hdd 10.91409 osd.157 up 1.00000 > >> 1.00000 > >> 159 hdd 10.91409 osd.159 up 1.00000 > >> 1.00000 > >> 161 hdd 10.91409 osd.161 up 1.00000 > >> 1.00000 > >> 163 hdd 10.91409 osd.163 up 1.00000 > >> 1.00000 > >> 165 hdd 10.91409 osd.165 up 1.00000 > >> 1.00000 > >> 167 hdd 10.91409 osd.167 up 1.00000 > >> 1.00000 > >> 169 hdd 10.91409 osd.169 up 1.00000 > >> 1.00000 > >> 171 hdd 10.91409 osd.171 up 1.00000 > >> 1.00000 > >> 173 hdd 10.91409 osd.173 up 1.00000 > >> 1.00000 > >> 175 hdd 10.91409 osd.175 up 1.00000 > >> 1.00000 > >> 177 hdd 10.91409 osd.177 up 1.00000 > >> 1.00000 > >> 179 hdd 10.91409 osd.179 up 1.00000 > >> 1.00000 > >> 181 hdd 10.91409 osd.181 up 1.00000 > >> 1.00000 > >> 183 hdd 10.91409 osd.183 up 1.00000 > >> 1.00000 > >> 185 hdd 10.91409 osd.185 up 1.00000 > >> 1.00000 > >> -82 261.93823 host cephx04 > >> 189 hdd 10.91409 osd.189 up 1.00000 > >> 1.00000 > >> 191 hdd 10.91409 osd.191 up 1.00000 > >> 1.00000 > >> 193 hdd 10.91409 osd.193 up 1.00000 > >> 1.00000 > >> 195 hdd 10.91409 osd.195 up 1.00000 > >> 1.00000 > >> 197 hdd 10.91409 osd.197 up 1.00000 > >> 1.00000 > >> 199 hdd 10.91409 osd.199 up 1.00000 > >> 1.00000 > >> 201 hdd 10.91409 osd.201 up 1.00000 > >> 1.00000 > >> 202 hdd 10.91409 osd.202 up 1.00000 > >> 1.00000 > >> 204 hdd 10.91409 osd.204 up 1.00000 > >> 1.00000 > >> 206 hdd 10.91409 osd.206 up 1.00000 > >> 1.00000 > >> 208 hdd 10.91409 osd.208 up 1.00000 > >> 1.00000 > >> 210 hdd 10.91409 osd.210 up 1.00000 > >> 1.00000 > >> 212 hdd 10.91409 osd.212 up 1.00000 > >> 1.00000 > >> 214 hdd 10.91409 osd.214 up 1.00000 > >> 1.00000 > >> 217 hdd 10.91409 osd.217 up 1.00000 > >> 1.00000 > >> 219 hdd 10.91409 osd.219 up 1.00000 > >> 1.00000 > >> 221 hdd 10.91409 osd.221 up 1.00000 > >> 1.00000 > >> 223 hdd 10.91409 osd.223 up 1.00000 > >> 1.00000 > >> 225 hdd 10.91409 osd.225 up 1.00000 > >> 1.00000 > >> 227 hdd 10.91409 osd.227 up 1.00000 > >> 1.00000 > >> 229 hdd 10.91409 osd.229 up 1.00000 > >> 1.00000 > >> 231 hdd 10.91409 osd.231 up 1.00000 > >> 1.00000 > >> 233 hdd 10.91409 osd.233 up 1.00000 > >> 1.00000 > >> 235 hdd 10.91409 osd.235 up 1.00000 > >> 1.00000 > >> -79 261.93823 host cephx06 > >> 188 hdd 10.91409 osd.188 up 1.00000 > >> 1.00000 > >> 190 hdd 10.91409 osd.190 up 1.00000 > >> 1.00000 > >> 192 hdd 10.91409 osd.192 up 1.00000 > >> 1.00000 > >> 194 hdd 10.91409 osd.194 up 1.00000 > >> 1.00000 > >> 196 hdd 10.91409 osd.196 up 1.00000 > >> 1.00000 > >> 198 hdd 10.91409 osd.198 up 1.00000 > >> 1.00000 > >> 200 hdd 10.91409 osd.200 up 1.00000 > >> 1.00000 > >> 203 hdd 10.91409 osd.203 up 1.00000 > >> 1.00000 > >> 205 hdd 10.91409 osd.205 up 1.00000 > >> 1.00000 > >> 207 hdd 10.91409 osd.207 up 1.00000 > >> 1.00000 > >> 209 hdd 10.91409 osd.209 up 1.00000 > >> 1.00000 > >> 211 hdd 10.91409 osd.211 up 1.00000 > >> 1.00000 > >> 213 hdd 10.91409 osd.213 up 1.00000 > >> 1.00000 > >> 215 hdd 10.91409 osd.215 up 1.00000 > >> 1.00000 > >> 216 hdd 10.91409 osd.216 up 1.00000 > >> 1.00000 > >> 218 hdd 10.91409 osd.218 up 1.00000 > >> 1.00000 > >> 220 hdd 10.91409 osd.220 up 1.00000 > >> 1.00000 > >> 222 hdd 10.91409 osd.222 up 1.00000 > >> 1.00000 > >> 224 hdd 10.91409 osd.224 up 1.00000 > >> 1.00000 > >> 226 hdd 10.91409 osd.226 up 1.00000 > >> 1.00000 > >> 228 hdd 10.91409 osd.228 up 1.00000 > >> 1.00000 > >> 230 hdd 10.91409 osd.230 up 1.00000 > >> 1.00000 > >> 232 hdd 10.91409 osd.232 up 1.00000 > >> 1.00000 > >> 234 hdd 10.91409 osd.234 down 1.00000 > >> 1.00000 > >> -91 261.93823 host cephx08 > >> 238 hdd 10.91409 osd.238 down 0 > >> 1.00000 > >> 239 hdd 10.91409 osd.239 down 0 > >> 1.00000 > >> 240 hdd 10.91409 osd.240 down 0 > >> 1.00000 > >> 241 hdd 10.91409 osd.241 down 0 > >> 1.00000 > >> 242 hdd 10.91409 osd.242 down 0 > >> 1.00000 > >> 243 hdd 10.91409 osd.243 down 0 > >> 1.00000 > >> 244 hdd 10.91409 osd.244 down 0 > >> 1.00000 > >> 245 hdd 10.91409 osd.245 down 0 > >> 1.00000 > >> 246 hdd 10.91409 osd.246 down 0 > >> 1.00000 > >> 247 hdd 10.91409 osd.247 down 0 > >> 1.00000 > >> 248 hdd 10.91409 osd.248 down 0 > >> 1.00000 > >> 249 hdd 10.91409 osd.249 down 0 > >> 1.00000 > >> 250 hdd 10.91409 osd.250 down 0 > >> 1.00000 > >> 251 hdd 10.91409 osd.251 down 0 > >> 1.00000 > >> 252 hdd 10.91409 osd.252 down 0 > >> 1.00000 > >> 253 hdd 10.91409 osd.253 down 0 > >> 1.00000 > >> 254 hdd 10.91409 osd.254 down 0 > >> 1.00000 > >> 255 hdd 10.91409 osd.255 down 0 > >> 1.00000 > >> 256 hdd 10.91409 osd.256 down 0 > >> 1.00000 > >> 257 hdd 10.91409 osd.257 down 0 > >> 1.00000 > >> 258 hdd 10.91409 osd.258 down 0 > >> 1.00000 > >> 259 hdd 10.91409 osd.259 down 0 > >> 1.00000 > >> 260 hdd 10.91409 osd.260 down 0 > >> 1.00000 > >> 261 hdd 10.91409 osd.261 down 0 > >> 1.00000 > >> -3 37.38275 host ceph06 > >> 40 1.00000 osd.40 down 0 > >> 1.00000 > >> 0 hdd 9.09569 osd.0 up 1.00000 > >> 1.00000 > >> 1 hdd 9.09569 osd.1 up 1.00000 > >> 1.00000 > >> 2 hdd 9.09569 osd.2 up 1.00000 > >> 1.00000 > >> 3 hdd 9.09569 osd.3 up 1.00000 > >> 1.00000 > >> > >> ceph health detail > >> > >> HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 1 > >> MDSs report slow metadata IOs; 19 osds down; 4 hosts (50 osds) down; > >> Reduced data availability: 1220 pgs inactive; Degraded data > >> redundancy: 132 pgs undersized > >> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s) > >> daemon rgw.cephx06.xxxxxx on cephx06 is in error state > >> [WRN] FS_DEGRADED: 1 filesystem is degraded > >> fs cephfs is degraded > >> [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs > >> mds.cephfs.cephx01.yyyyyy(mds.0): 2 slow metadata IOs are blocked > >> > 30 secs, oldest blocked for 1664 secs > >> [WRN] OSD_DOWN: 19 osds down > >> osd.154 (root=default,host=cephx01) is down > >> osd.156 (root=default,host=cephx01) is down > >> osd.158 (root=default,host=cephx01) is down > >> osd.160 (root=default,host=cephx01) is down > >> osd.162 (root=default,host=cephx01) is down > >> osd.164 (root=default,host=cephx01) is down > >> osd.166 (root=default,host=cephx01) is down > >> osd.168 (root=default,host=cephx01) is down > >> osd.170 (root=default,host=cephx01) is down > >> osd.172 (root=default,host=cephx01) is down > >> osd.174 (root=default,host=cephx01) is down > >> osd.176 (root=default,host=cephx01) is down > >> osd.178 (root=default,host=cephx01) is down osd.180 > >> (root=default,host=cephx01) is down > >> osd.182 (root=default,host=cephx01) is down > >> osd.184 (root=default,host=cephx01) is down > >> osd.186 (root=ssds,host=cephx01-ssd) is down > >> osd.234 (root=default,host=cephx06) is down > >> osd.262 (root=ssds,host=cephx08-ssd) is down > >> [WRN] OSD_HOST_DOWN: 4 hosts (50 osds) down > >> host cephx01-ssd (root=ssds) (1 osds) is down > >> host cephx01 (root=default) (24 osds) is down > >> host cephx08 (root=default) (24 osds) is down > >> host cephx08-ssd (root=ssds) (1 osds) is down > >> [WRN] PG_AVAILABILITY: Reduced data availability: 1220 pgs inactive > >> pg 7.3cd is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3ce is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3cf is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d0 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d1 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d2 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d3 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d4 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d5 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d6 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d7 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d8 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3d9 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3da is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3db is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3dc is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3dd is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3de is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3df is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e0 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e1 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e2 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e3 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e4 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e5 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e6 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e7 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e8 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3e9 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3ea is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3eb is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3ec is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3ed is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3ee is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3ef is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f0 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f1 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f2 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f3 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f4 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f5 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f6 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f7 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f8 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3f9 is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3fa is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3fb is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3fc is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3fd is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3fe is stuck inactive for 13h, current state unknown, last > >> acting [] > >> pg 7.3ff is stuck inactive for 13h, current state unknown, last > >> acting [] > >> [WRN] PG_DEGRADED: Degraded data redundancy: 132 pgs undersized > >> pg 1.0 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 1.1 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 1.7 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,237] > >> pg 1.b is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [236,263] > >> pg 1.e is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 2.3 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 2.8 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 2.9 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 2.a is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [236,263] > >> pg 2.b is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,237] > >> pg 2.d is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 2.e is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 3.2 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [236,263] > >> pg 3.4 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [237,263] > >> pg 3.8 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 3.a is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [236,237] > >> pg 3.c is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 3.d is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,237] > >> pg 3.f is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 4.3 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,237] > >> pg 4.5 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 4.b is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [237,236] > >> pg 4.d is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 4.e is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 5.2 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 5.3 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 5.5 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 5.8 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 5.a is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 5.b is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [236,263] > >> pg 5.c is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 5.e is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 5.f is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 9.0 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 9.1 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [236,237] > >> pg 9.2 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 9.3 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 9.5 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 9.6 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 9.7 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 9.9 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,237] > >> pg 10.0 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [237,236] > >> pg 10.1 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 10.2 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 10.3 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 10.4 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [237] > >> pg 10.5 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 10.6 is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [263,236] > >> pg 10.7 is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 10.a is stuck undersized for 13h, current state > >> undersized+remapped+peered, last acting [236] > >> pg 10.c is stuck undersized for 13h, current state > >> active+undersized+remapped, last acting [237,236] > >> > >> Best, > >> Malte > >> > >> Am 21.06.23 um 10:31 schrieb Eugen Block: > >>> Hi, > >>> > >>>> Yes, we drained the nodes. It needed two weeks to finish the > >>>> process, and yes, I think this is the root cause. > >>>> So we still have the nodes but when I try to restart one of those > >>>> OSDs it still cannot join: > >>> > >>> if the nodes were drained successfully (can you confirm that all PGs > >>> were active+clean after draining before you removed the nodes?) then > >>> the disks on the removed nodes wouldn't have any data to bring back. > >>> The question would be, why do the remaining OSDs still reference > >>> removed OSDs. Or am I misunderstanding something? I think it would > >>> help to know the whole story, can you provide more details? Also some > >>> more general cluster info would be helpful: > >>> $ ceph -s > >>> $ ceph osd tree > >>> $ ceph health detail > >>> > >>> > >>> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>: > >>> > >>>> Hello Eugen, > >>>> > >>>> thank you. Yesterday I thought: Well, Eugen can help! > >>>> > >>>> Yes, we drained the nodes. It needed two weeks to finish the > >>>> process, and yes, I think this is the root cause. > >>>> > >>>> So we still have the nodes but when I try to restart one of those > >>>> OSDs it still cannot join: > >>>> > >>>> Jun 21 09:46:03 ceph-node bash[2323668]: Running command: > >>>> /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/block > >>>> Jun 21 09:46:03 ceph-node bash[2323668]: Running command: > >>>> /usr/bin/chown -R ceph:ceph /dev/dm-19 > >>>> Jun 21 09:46:03 ceph-node bash[2323668]: Running command: > >>>> /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66 > >>>> Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm > >>>> activate successful for osd ID: 66 Jun 21 09:51:04 ceph-node > >>>> bash[2323668]: debug 2023-06-21T07:51:04.176+0000 7fabef5a1200 0 > >>>> monclient(hunting): authenticate timed out after 300 > >>>> Jun 21 09:56:04 ceph-node bash[2323668]: debug > >>>> 2023-06-21T07:56:04.179+0000 7fabef5a1200 0 monclient(hunting): > >>>> authenticate timed out after 300 > >>>> Jun 21 10:01:04 ceph-node bash[2323668]: debug > >>>> 2023-06-21T08:01:04.177+0000 7fabef5a1200 0 monclient(hunting): > >>>> authenticate timed out after 300 > >>>> Jun 21 10:06:04 ceph-node bash[2323668]: debug > >>>> 2023-06-21T08:06:04.179+0000 7fabef5a1200 0 monclient(hunting): > >>>> authenticate timed out after 300 > >>>> Jun 21 10:11:04 ceph-node bash[2323668]: debug > >>>> 2023-06-21T08:11:04.174+0000 7fabef5a1200 0 monclient(hunting): > >>>> authenticate timed out after 300 > >>>> > >>>> Same messages on all OSDs. > >>>> > >>>> We still have some nodes running and did not restart those OSDs. > >>>> > >>>> Best, > >>>> Malte > >>>> > >>>> Am 21.06.23 um 09:50 schrieb Eugen Block: > >>>>> Hi, > >>>>> can you share more details what exactly you did? How did you remove > >>>>> the nodes? Hopefully, you waited for the draining to finish? But if > >>>>> the remaining OSDs wait for removed OSDs it sounds like the > >>>>> draining was not finished. > >>>>> > >>>>> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>: > >>>>> > >>>>>> Hello, > >>>>>> > >>>>>> we removed some nodes from our cluster. This worked without > problems. > >>>>>> > >>>>>> Now, lots of OSDs do not want to join the cluster anymore if we > >>>>>> reboot one of the still available nodes. > >>>>>> > >>>>>> It always runs into timeouts: > >>>>>> > >>>>>> --> ceph-volume lvm activate successful for osd ID: XX > >>>>>> monclient(hunting): authenticate timed out after 300 > >>>>>> > >>>>>> MONs and MGRs are running fine. > >>>>>> > >>>>>> Network is working, netcat to the MONs' ports are open. > >>>>>> > >>>>>> Setting a higher debug level has no effect even if we add it to > >>>>>> the ceph.conf file. > >>>>>> > >>>>>> The PGs are pretty unhappy, e. g.: > >>>>>> > >>>>>> 7.143 87771 0 0 0 0 > >>>>>> 314744902235 0 0 10081 10081 down > >>>>>> 2023-06-20T09:16:03.546158+0000 961275'1395646 961300:9605547 > >>>>>> [209,NONE,NONE] 209 [209,NONE,NONE] 209 961231'1395512 > >>>>>> 2023-06-19T23:46:40.101791+0000 961231'1395512 > >>>>>> 2023-06-19T23:46:40.101791+0000 > >>>>>> > >>>>>> PG query wants us to set an OSD lost however I do not want to do > >>>>>> this. > >>>>>> > >>>>>> OSDs are blocked by OSDs from the removed nodes: > >>>>>> > >>>>>> ceph osd blocked-by > >>>>>> osd num_blocked > >>>>>> 152 38 > >>>>>> 244 41 > >>>>>> 144 54 > >>>>>> ... > >>>>>> > >>>>>> We added the removed hosts again and tried to start the OSDs on > >>>>>> this node and they also failed into the timeout mentioned above. > >>>>>> > >>>>>> This is a containerized cluster running version 16.2.10. > >>>>>> > >>>>>> Replication is 3, some pools use an erasure coded profile. > >>>>>> > >>>>>> Best regards, > >>>>>> Malte > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >>> > >>> > > > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx