Re: OSDs cannot join cluster anymore

Malte Stroem <malte.stroem@xxxxxxxxx> · Wed, 21 Jun 2023 10:22:22 +0200

Hello Eugen,

thank you. Yesterday I thought: Well, Eugen can help!

Yes, we drained the nodes. It needed two weeks to finish the process, 
and yes, I think this is the root cause.

So we still have the nodes but when I try to restart one of those OSDs 
it still cannot join:

Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown 
-h ceph:ceph /var/lib/ceph/osd/ceph-66/block
Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown 
-R ceph:ceph /dev/dm-19
Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown 
-R ceph:ceph /var/lib/ceph/osd/ceph-66
Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm activate 
successful for osd ID: 66 

Jun 21 09:51:04 ceph-node bash[2323668]: debug 
2023-06-21T07:51:04.176+0000 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 09:56:04 ceph-node bash[2323668]: debug 
2023-06-21T07:56:04.179+0000 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 10:01:04 ceph-node bash[2323668]: debug 
2023-06-21T08:01:04.177+0000 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 10:06:04 ceph-node bash[2323668]: debug 
2023-06-21T08:06:04.179+0000 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 10:11:04 ceph-node bash[2323668]: debug 
2023-06-21T08:11:04.174+0000 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300

Same messages on all OSDs.

We still have some nodes running and did not restart those OSDs.

Best,
Malte

Am 21.06.23 um 09:50 schrieb Eugen Block:
Hi,
can you share more details what exactly you did? How did you remove the 
nodes? Hopefully, you waited for the draining to finish? But if the 
remaining OSDs wait for removed OSDs it sounds like the draining was not 
finished.

Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:

Hello,

we removed some nodes from our cluster. This worked without problems.

Now, lots of OSDs do not want to join the cluster anymore if we reboot 
one of the still available nodes.

It always runs into timeouts:

--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300

MONs and MGRs are running fine.

Network is working, netcat to the MONs' ports are open.

Setting a higher debug level has no effect even if we add it to the 
ceph.conf file.

The PGs are pretty unhappy, e. g.:

7.143      87771                   0         0          0        0 
314744902235            0           0  10081     10081      down  
2023-06-20T09:16:03.546158+0000    961275'1395646 961300:9605547  
[209,NONE,NONE]         209  [209,NONE,NONE] 209    961231'1395512  
2023-06-19T23:46:40.101791+0000    961231'1395512  
2023-06-19T23:46:40.101791+0000

PG query wants us to set an OSD lost however I do not want to do this.

OSDs are blocked by OSDs from the removed nodes:

ceph osd blocked-by
osd  num_blocked
152           38
244           41
144           54
...

We added the removed hosts again and tried to start the OSDs on this 
node and they also failed into the timeout mentioned above.

This is a containerized cluster running version 16.2.10.

Replication is 3, some pools use an erasure coded profile.

Best regards,
Malte

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx