Re: cephadm failing to add hosts despite a working SSH connection

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Answering to myself... I hesitated to send this email to the list as the problem didn't seem to be related to Ceph itself but rather a configuration problem that Ceph was a victim of. I managed to find the problem: we are using jumbo frames on all servers but the VLAN shared by the servers and the RGWs is going through an intermediate (campus) network that doesn't seem to support jumbo frames (we were not aware of this). The problem was not appearing when using the intranet address because the Ceph servers don't use jumbo frames on this network/interface (it is a 1 Gb management network so no point to use Jumbo frames). I cannot think of anything that Ceph could have mentioned to help diagnose this.

Best regards,

Michel

Le 25/10/2023 à 14:42, Michel Jouvin a écrit :
Hi,

I'm struggling with a problem to add cephadm some hosts in our Quincy cluster. "ceph orch host add host addr" fails with the famous "missing 2 required positional arguments: 'hostname' and 'addr'" because of bug https://tracker.ceph.com/issues/59081 but looking at cephadm messages with "ceph -W cephadm", I can see:

--------

Log: Opening SSH connection to 10.81.22.183, port 22
[conn=736] Connected to SSH server at 10.81.22.183, port 22
[conn=736]   Local address: 10.81.22.151, port 53640
[conn=736]   Peer address: 10.81.22.183, port 22
[conn=736] Login timeout expired
[conn=736] Aborting connection
Traceback (most recent call last): (removed)
cephadm.ssh.HostConnectionError: Failed to connect to jc-rgw3 (10.81.22.183). Login timeout expired
Log: Opening SSH connection to 10.81.22.183, port 22
[conn=736] Connected to SSH server at 10.81.22.183, port 22
[conn=736]   Local address: 10.81.22.151, port 53640
[conn=736]   Peer address: 10.81.22.183, port 22
[conn=736] Login timeout expired
[conn=736] Aborting connection
--------

It is very strange for me because " ssh -i /tmp/cephadm_identity_xxx 10.81.22.183" is working fine |when executed in the active mgr container.
|

|The host I'm trying to add is a RGW that has 3 active network connections: Ceph public network, our intranet network (used for managing the server) and the network of the application that will use the RGW. It seems to be somewhat related to this network configuration as main cluster servers (MONs, OSDs) which have only the the 2 Ceph networks and the intranet one don't suffer the same problem. In particular, what is strange is that I can successfully add the host if I use its intranet adress rather than the Ceph public network one (|||10.81.22.183) in the cephadm command.

I have 3 hosts sharing the same network configuration and having the same problem.

Any hint or suggestion to troubleshoot further this problem would be highly appreciated!

Best regards,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux