Hello Eugen,
recovery and rebalancing was finished however now all PGs show missing OSDs.
Everything looks like the PGs are missing OSDs although it finished
correctly.
As if we shut down the servers immediately.
But we removed the nodes the way it is described in the documentation.
We just added new disks and they join the cluster immediately.
So the old OSDs removed from the cluster are available, I restored
OSD.40 but it does not want to join the cluster.
Following are the outputs of the mentioned commands:
ceph -s
cluster:
id: X
health: HEALTH_WARN
1 failed cephadm daemon(s)
1 filesystem is degraded
1 MDSs report slow metadata IOs
19 osds down
4 hosts (50 osds) down
Reduced data availability: 1220 pgs inactive
Degraded data redundancy: 132 pgs undersized
services:
mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m)
mgr: cephx02.xxxxxx(active, since 92s), standbys:
cephx04.yyyyyy, cephx06.zzzzzz mds: 2/2 daemons up, 2 standby
osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/2 healthy, 1 recovering
pools: 12 pools, 1345 pgs
objects: 11.02k objects, 1.9 GiB
usage: 145 TiB used, 669 TiB / 814 TiB avail
pgs: 86.617% pgs unknown
4.089% pgs not active
39053/33069 objects misplaced (118.095%)
1165 unknown
77 active+undersized+remapped
55 undersized+remapped+peered
38 active+clean+remapped
10 active+clean
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-21 4.36646 root ssds
-61 0.87329 host cephx01-ssd
186 ssd 0.87329 osd.186 down 1.00000 1.00000
-76 0.87329 host cephx02-ssd
263 ssd 0.87329 osd.263 up 1.00000 1.00000
-85 0.87329 host cephx04-ssd
237 ssd 0.87329 osd.237 up 1.00000 1.00000
-88 0.87329 host cephx06-ssd
236 ssd 0.87329 osd.236 up 1.00000 1.00000
-94 0.87329 host cephx08-ssd
262 ssd 0.87329 osd.262 down 1.00000 1.00000
-1 1347.07397 root default
-62 261.93823 host cephx01
139 hdd 10.91409 osd.139 down 0 1.00000
140 hdd 10.91409 osd.140 down 0 1.00000
142 hdd 10.91409 osd.142 down 0 1.00000
144 hdd 10.91409 osd.144 down 0 1.00000
146 hdd 10.91409 osd.146 down 0 1.00000
148 hdd 10.91409 osd.148 down 0 1.00000
150 hdd 10.91409 osd.150 down 0 1.00000
152 hdd 10.91409 osd.152 down 0 1.00000
154 hdd 10.91409 osd.154 down 1.00000 1.00000
156 hdd 10.91409 osd.156 down 1.00000 1.00000
158 hdd 10.91409 osd.158 down 1.00000 1.00000
160 hdd 10.91409 osd.160 down 1.00000 1.00000
162 hdd 10.91409 osd.162 down 1.00000 1.00000
164 hdd 10.91409 osd.164 down 1.00000 1.00000
166 hdd 10.91409 osd.166 down 1.00000 1.00000
168 hdd 10.91409 osd.168 down 1.00000 1.00000
170 hdd 10.91409 osd.170 down 1.00000 1.00000
172 hdd 10.91409 osd.172 down 1.00000 1.00000
174 hdd 10.91409 osd.174 down 1.00000 1.00000
176 hdd 10.91409 osd.176 down 1.00000 1.00000
178 hdd 10.91409 osd.178 down 1.00000 1.00000
180 hdd 10.91409 osd.180 down 1.00000 1.00000
182 hdd 10.91409 osd.182 down 1.00000 1.00000
184 hdd 10.91409 osd.184 down 1.00000 1.00000
-67 261.93823 host cephx02
138 hdd 10.91409 osd.138 up 1.00000 1.00000
141 hdd 10.91409 osd.141 up 1.00000 1.00000
143 hdd 10.91409 osd.143 up 1.00000 1.00000
145 hdd 10.91409 osd.145 up 1.00000 1.00000
147 hdd 10.91409 osd.147 up 1.00000 1.00000
149 hdd 10.91409 osd.149 up 1.00000 1.00000
151 hdd 10.91409 osd.151 up 1.00000 1.00000
153 hdd 10.91409 osd.153 up 1.00000 1.00000
155 hdd 10.91409 osd.155 up 1.00000 1.00000
157 hdd 10.91409 osd.157 up 1.00000 1.00000
159 hdd 10.91409 osd.159 up 1.00000 1.00000
161 hdd 10.91409 osd.161 up 1.00000 1.00000
163 hdd 10.91409 osd.163 up 1.00000 1.00000
165 hdd 10.91409 osd.165 up 1.00000 1.00000
167 hdd 10.91409 osd.167 up 1.00000 1.00000
169 hdd 10.91409 osd.169 up 1.00000 1.00000
171 hdd 10.91409 osd.171 up 1.00000 1.00000
173 hdd 10.91409 osd.173 up 1.00000 1.00000
175 hdd 10.91409 osd.175 up 1.00000 1.00000
177 hdd 10.91409 osd.177 up 1.00000 1.00000
179 hdd 10.91409 osd.179 up 1.00000 1.00000
181 hdd 10.91409 osd.181 up 1.00000 1.00000
183 hdd 10.91409 osd.183 up 1.00000 1.00000
185 hdd 10.91409 osd.185 up 1.00000 1.00000
-82 261.93823 host cephx04
189 hdd 10.91409 osd.189 up 1.00000 1.00000
191 hdd 10.91409 osd.191 up 1.00000 1.00000
193 hdd 10.91409 osd.193 up 1.00000 1.00000
195 hdd 10.91409 osd.195 up 1.00000 1.00000
197 hdd 10.91409 osd.197 up 1.00000 1.00000
199 hdd 10.91409 osd.199 up 1.00000 1.00000
201 hdd 10.91409 osd.201 up 1.00000 1.00000
202 hdd 10.91409 osd.202 up 1.00000 1.00000
204 hdd 10.91409 osd.204 up 1.00000 1.00000
206 hdd 10.91409 osd.206 up 1.00000 1.00000
208 hdd 10.91409 osd.208 up 1.00000 1.00000
210 hdd 10.91409 osd.210 up 1.00000 1.00000
212 hdd 10.91409 osd.212 up 1.00000 1.00000
214 hdd 10.91409 osd.214 up 1.00000 1.00000
217 hdd 10.91409 osd.217 up 1.00000 1.00000
219 hdd 10.91409 osd.219 up 1.00000 1.00000
221 hdd 10.91409 osd.221 up 1.00000 1.00000
223 hdd 10.91409 osd.223 up 1.00000 1.00000
225 hdd 10.91409 osd.225 up 1.00000 1.00000
227 hdd 10.91409 osd.227 up 1.00000 1.00000
229 hdd 10.91409 osd.229 up 1.00000 1.00000
231 hdd 10.91409 osd.231 up 1.00000 1.00000
233 hdd 10.91409 osd.233 up 1.00000 1.00000
235 hdd 10.91409 osd.235 up 1.00000 1.00000
-79 261.93823 host cephx06
188 hdd 10.91409 osd.188 up 1.00000 1.00000
190 hdd 10.91409 osd.190 up 1.00000 1.00000
192 hdd 10.91409 osd.192 up 1.00000 1.00000
194 hdd 10.91409 osd.194 up 1.00000 1.00000
196 hdd 10.91409 osd.196 up 1.00000 1.00000
198 hdd 10.91409 osd.198 up 1.00000 1.00000
200 hdd 10.91409 osd.200 up 1.00000 1.00000
203 hdd 10.91409 osd.203 up 1.00000 1.00000
205 hdd 10.91409 osd.205 up 1.00000 1.00000
207 hdd 10.91409 osd.207 up 1.00000 1.00000
209 hdd 10.91409 osd.209 up 1.00000 1.00000
211 hdd 10.91409 osd.211 up 1.00000 1.00000
213 hdd 10.91409 osd.213 up 1.00000 1.00000
215 hdd 10.91409 osd.215 up 1.00000 1.00000
216 hdd 10.91409 osd.216 up 1.00000 1.00000
218 hdd 10.91409 osd.218 up 1.00000 1.00000
220 hdd 10.91409 osd.220 up 1.00000 1.00000
222 hdd 10.91409 osd.222 up 1.00000 1.00000
224 hdd 10.91409 osd.224 up 1.00000 1.00000
226 hdd 10.91409 osd.226 up 1.00000 1.00000
228 hdd 10.91409 osd.228 up 1.00000 1.00000
230 hdd 10.91409 osd.230 up 1.00000 1.00000
232 hdd 10.91409 osd.232 up 1.00000 1.00000
234 hdd 10.91409 osd.234 down 1.00000 1.00000
-91 261.93823 host cephx08
238 hdd 10.91409 osd.238 down 0 1.00000
239 hdd 10.91409 osd.239 down 0 1.00000
240 hdd 10.91409 osd.240 down 0 1.00000
241 hdd 10.91409 osd.241 down 0 1.00000
242 hdd 10.91409 osd.242 down 0 1.00000
243 hdd 10.91409 osd.243 down 0 1.00000
244 hdd 10.91409 osd.244 down 0 1.00000
245 hdd 10.91409 osd.245 down 0 1.00000
246 hdd 10.91409 osd.246 down 0 1.00000
247 hdd 10.91409 osd.247 down 0 1.00000
248 hdd 10.91409 osd.248 down 0 1.00000
249 hdd 10.91409 osd.249 down 0 1.00000
250 hdd 10.91409 osd.250 down 0 1.00000
251 hdd 10.91409 osd.251 down 0 1.00000
252 hdd 10.91409 osd.252 down 0 1.00000
253 hdd 10.91409 osd.253 down 0 1.00000
254 hdd 10.91409 osd.254 down 0 1.00000
255 hdd 10.91409 osd.255 down 0 1.00000
256 hdd 10.91409 osd.256 down 0 1.00000
257 hdd 10.91409 osd.257 down 0 1.00000
258 hdd 10.91409 osd.258 down 0 1.00000
259 hdd 10.91409 osd.259 down 0 1.00000
260 hdd 10.91409 osd.260 down 0 1.00000
261 hdd 10.91409 osd.261 down 0 1.00000
-3 37.38275 host ceph06
40 1.00000 osd.40 down 0 1.00000
0 hdd 9.09569 osd.0 up 1.00000 1.00000
1 hdd 9.09569 osd.1 up 1.00000 1.00000
2 hdd 9.09569 osd.2 up 1.00000 1.00000
3 hdd 9.09569 osd.3 up 1.00000 1.00000
ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 1
MDSs report slow metadata IOs; 19 osds down; 4 hosts (50 osds) down;
Reduced data availability: 1220 pgs inactive; Degraded data
redundancy: 132 pgs undersized
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon rgw.cephx06.xxxxxx on cephx06 is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.cephfs.cephx01.yyyyyy(mds.0): 2 slow metadata IOs are
blocked > 30 secs, oldest blocked for 1664 secs
[WRN] OSD_DOWN: 19 osds down
osd.154 (root=default,host=cephx01) is down
osd.156 (root=default,host=cephx01) is down
osd.158 (root=default,host=cephx01) is down
osd.160 (root=default,host=cephx01) is down
osd.162 (root=default,host=cephx01) is down
osd.164 (root=default,host=cephx01) is down
osd.166 (root=default,host=cephx01) is down
osd.168 (root=default,host=cephx01) is down
osd.170 (root=default,host=cephx01) is down
osd.172 (root=default,host=cephx01) is down
osd.174 (root=default,host=cephx01) is down
osd.176 (root=default,host=cephx01) is down
osd.178 (root=default,host=cephx01) is down osd.180
(root=default,host=cephx01) is down
osd.182 (root=default,host=cephx01) is down
osd.184 (root=default,host=cephx01) is down
osd.186 (root=ssds,host=cephx01-ssd) is down
osd.234 (root=default,host=cephx06) is down
osd.262 (root=ssds,host=cephx08-ssd) is down
[WRN] OSD_HOST_DOWN: 4 hosts (50 osds) down
host cephx01-ssd (root=ssds) (1 osds) is down
host cephx01 (root=default) (24 osds) is down
host cephx08 (root=default) (24 osds) is down
host cephx08-ssd (root=ssds) (1 osds) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 1220 pgs inactive
pg 7.3cd is stuck inactive for 13h, current state unknown, last acting []
pg 7.3ce is stuck inactive for 13h, current state unknown, last acting []
pg 7.3cf is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d0 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d1 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d2 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d3 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d4 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d5 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d6 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d7 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d8 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3d9 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3da is stuck inactive for 13h, current state unknown, last acting []
pg 7.3db is stuck inactive for 13h, current state unknown, last acting []
pg 7.3dc is stuck inactive for 13h, current state unknown, last acting []
pg 7.3dd is stuck inactive for 13h, current state unknown, last acting []
pg 7.3de is stuck inactive for 13h, current state unknown, last acting []
pg 7.3df is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e0 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e1 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e2 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e3 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e4 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e5 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e6 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e7 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e8 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3e9 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3ea is stuck inactive for 13h, current state unknown, last acting []
pg 7.3eb is stuck inactive for 13h, current state unknown, last acting []
pg 7.3ec is stuck inactive for 13h, current state unknown, last acting []
pg 7.3ed is stuck inactive for 13h, current state unknown, last acting []
pg 7.3ee is stuck inactive for 13h, current state unknown, last acting []
pg 7.3ef is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f0 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f1 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f2 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f3 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f4 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f5 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f6 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f7 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f8 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3f9 is stuck inactive for 13h, current state unknown, last acting []
pg 7.3fa is stuck inactive for 13h, current state unknown, last acting []
pg 7.3fb is stuck inactive for 13h, current state unknown, last acting []
pg 7.3fc is stuck inactive for 13h, current state unknown, last acting []
pg 7.3fd is stuck inactive for 13h, current state unknown, last acting []
pg 7.3fe is stuck inactive for 13h, current state unknown, last acting []
pg 7.3ff is stuck inactive for 13h, current state unknown, last acting []
[WRN] PG_DEGRADED: Degraded data redundancy: 132 pgs undersized
pg 1.0 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 1.1 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 1.7 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,237]
pg 1.b is stuck undersized for 13h, current state
active+undersized+remapped, last acting [236,263]
pg 1.e is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 2.3 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 2.8 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 2.9 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 2.a is stuck undersized for 13h, current state
active+undersized+remapped, last acting [236,263]
pg 2.b is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,237]
pg 2.d is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 2.e is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 3.2 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [236,263]
pg 3.4 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [237,263]
pg 3.8 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 3.a is stuck undersized for 13h, current state
active+undersized+remapped, last acting [236,237]
pg 3.c is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 3.d is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,237]
pg 3.f is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 4.3 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,237]
pg 4.5 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 4.b is stuck undersized for 13h, current state
active+undersized+remapped, last acting [237,236]
pg 4.d is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 4.e is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 5.2 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 5.3 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 5.5 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 5.8 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 5.a is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 5.b is stuck undersized for 13h, current state
active+undersized+remapped, last acting [236,263]
pg 5.c is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 5.e is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 5.f is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 9.0 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 9.1 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [236,237]
pg 9.2 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 9.3 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 9.5 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 9.6 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 9.7 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 9.9 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,237]
pg 10.0 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [237,236]
pg 10.1 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 10.2 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 10.3 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 10.4 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [237]
pg 10.5 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 10.6 is stuck undersized for 13h, current state
active+undersized+remapped, last acting [263,236]
pg 10.7 is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 10.a is stuck undersized for 13h, current state
undersized+remapped+peered, last acting [236]
pg 10.c is stuck undersized for 13h, current state
active+undersized+remapped, last acting [237,236]
Best,
Malte
Am 21.06.23 um 10:31 schrieb Eugen Block:
Hi,
Yes, we drained the nodes. It needed two weeks to finish the
process, and yes, I think this is the root cause.
So we still have the nodes but when I try to restart one of those
OSDs it still cannot join:
if the nodes were drained successfully (can you confirm that all
PGs were active+clean after draining before you removed the nodes?)
then the disks on the removed nodes wouldn't have any data to bring
back. The question would be, why do the remaining OSDs still
reference removed OSDs. Or am I misunderstanding something? I think
it would help to know the whole story, can you provide more
details? Also some more general cluster info would be helpful:
$ ceph -s
$ ceph osd tree
$ ceph health detail
Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
Hello Eugen,
thank you. Yesterday I thought: Well, Eugen can help!
Yes, we drained the nodes. It needed two weeks to finish the
process, and yes, I think this is the root cause.
So we still have the nodes but when I try to restart one of those
OSDs it still cannot join:
Jun 21 09:46:03 ceph-node bash[2323668]: Running command:
/usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/block
Jun 21 09:46:03 ceph-node bash[2323668]: Running command:
/usr/bin/chown -R ceph:ceph /dev/dm-19
Jun 21 09:46:03 ceph-node bash[2323668]: Running command:
/usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66
Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm
activate successful for osd ID: 66 Jun 21 09:51:04 ceph-node
bash[2323668]: debug 2023-06-21T07:51:04.176+0000 7fabef5a1200 0
monclient(hunting): authenticate timed out after 300
Jun 21 09:56:04 ceph-node bash[2323668]: debug
2023-06-21T07:56:04.179+0000 7fabef5a1200 0 monclient(hunting):
authenticate timed out after 300
Jun 21 10:01:04 ceph-node bash[2323668]: debug
2023-06-21T08:01:04.177+0000 7fabef5a1200 0 monclient(hunting):
authenticate timed out after 300
Jun 21 10:06:04 ceph-node bash[2323668]: debug
2023-06-21T08:06:04.179+0000 7fabef5a1200 0 monclient(hunting):
authenticate timed out after 300
Jun 21 10:11:04 ceph-node bash[2323668]: debug
2023-06-21T08:11:04.174+0000 7fabef5a1200 0 monclient(hunting):
authenticate timed out after 300
Same messages on all OSDs.
We still have some nodes running and did not restart those OSDs.
Best,
Malte
Am 21.06.23 um 09:50 schrieb Eugen Block:
Hi,
can you share more details what exactly you did? How did you
remove the nodes? Hopefully, you waited for the draining to
finish? But if the remaining OSDs wait for removed OSDs it sounds
like the draining was not finished.
Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
Hello,
we removed some nodes from our cluster. This worked without problems.
Now, lots of OSDs do not want to join the cluster anymore if we
reboot one of the still available nodes.
It always runs into timeouts:
--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300
MONs and MGRs are running fine.
Network is working, netcat to the MONs' ports are open.
Setting a higher debug level has no effect even if we add it to
the ceph.conf file.
The PGs are pretty unhappy, e. g.:
7.143 87771 0 0 0
0 314744902235 0 0 10081 10081
down 2023-06-20T09:16:03.546158+0000 961275'1395646
961300:9605547 [209,NONE,NONE] 209 [209,NONE,NONE]
209 961231'1395512 2023-06-19T23:46:40.101791+0000
961231'1395512 2023-06-19T23:46:40.101791+0000
PG query wants us to set an OSD lost however I do not want to do this.
OSDs are blocked by OSDs from the removed nodes:
ceph osd blocked-by
osd num_blocked
152 38
244 41
144 54
...
We added the removed hosts again and tried to start the OSDs on
this node and they also failed into the timeout mentioned above.
This is a containerized cluster running version 16.2.10.
Replication is 3, some pools use an erasure coded profile.
Best regards,
Malte
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx