Re: cephadm failing to add hosts despite a working SSH connection

Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> · Wed, 25 Oct 2023 18:40:52 +0200

Answering to myself... I hesitated to send this email to the list as the 
problem didn't seem to be related to Ceph itself but rather a 
configuration problem that Ceph was a victim of. I managed to find the 
problem: we are using jumbo frames on all servers but the VLAN shared by 
the servers and the RGWs is going through an intermediate (campus) 
network that doesn't seem to support jumbo frames (we were not aware of 
this). The problem was not appearing when using the intranet address 
because the Ceph servers don't use jumbo frames on this 
network/interface (it is a 1 Gb management network so no point to use 
Jumbo frames). I cannot think of anything that Ceph could have mentioned 
to help diagnose this.

Best regards,

Michel

Le 25/10/2023 à 14:42, Michel Jouvin a écrit :
Hi,

I'm struggling with a problem to add cephadm some hosts in our Quincy 
cluster. "ceph orch host add host addr" fails with the famous "missing 
2 required positional arguments: 'hostname' and 'addr'" because of bug 
https://tracker.ceph.com/issues/59081 but looking at cephadm messages 
with "ceph -W cephadm", I can see:

--------

Log: Opening SSH connection to 10.81.22.183, port 22
[conn=736] Connected to SSH server at 10.81.22.183, port 22
[conn=736]   Local address: 10.81.22.151, port 53640
[conn=736]   Peer address: 10.81.22.183, port 22
[conn=736] Login timeout expired
[conn=736] Aborting connection
Traceback (most recent call last): (removed)
cephadm.ssh.HostConnectionError: Failed to connect to jc-rgw3 
(10.81.22.183). Login timeout expired
Log: Opening SSH connection to 10.81.22.183, port 22
[conn=736] Connected to SSH server at 10.81.22.183, port 22
[conn=736]   Local address: 10.81.22.151, port 53640
[conn=736]   Peer address: 10.81.22.183, port 22
[conn=736] Login timeout expired
[conn=736] Aborting connection
--------

It is very strange for me because " ssh -i /tmp/cephadm_identity_xxx 
10.81.22.183" is working fine |when executed in the active mgr container.
|

|The host I'm trying to add is a RGW that has 3 active network 
connections: Ceph public network, our intranet network (used for 
managing the server) and the network of the application that will use 
the RGW. It seems to be somewhat related to this network configuration 
as main cluster servers (MONs, OSDs) which have only the the 2 Ceph 
networks and the intranet one don't suffer the same problem. In 
particular, what is strange is that I can successfully add the host if 
I use its intranet adress rather than the Ceph public network one 
(|||10.81.22.183) in the cephadm command.

I have 3 hosts sharing the same network configuration and having the 
same problem.

Any hint or suggestion to troubleshoot further this problem would be 
highly appreciated!

Best regards,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx