Hi David, After some investigation, it seems that it affects all osd nodes, in a way that the same node sometimes is being announced through the cluster_network (10.114) and others through the public_network (10.113). Since each node has 12 OSDs I suspect it could depends on the specific OSD. This seems to discard a network misconfiguration because OSDs in the same node are being announced through cluster or public network. I will look for further information to verify this, but now I am inclined to think that this could be a bug introduced in 16.2.6. >> for n in {1..100}; do strace -f -e trace=network -s 10000 rbd info benchimage-test-c27-35-$n --pool glance-images --name client.glance 2>&1| grep sin_addr | egrep '(10.113.29|10.114.29)'| awk '{print $6}'; done | sort | uniq -c 15 sin_addr=inet_addr("10.113.29.1")}, 18 sin_addr=inet_addr("10.113.29.10")}, 29 sin_addr=inet_addr("10.113.29.11")}, 12 sin_addr=inet_addr("10.113.29.12")}, 45 sin_addr=inet_addr("10.113.29.13")}, 27 sin_addr=inet_addr("10.113.29.14")}, 310 sin_addr=inet_addr("10.113.29.15")}, 29 sin_addr=inet_addr("10.113.29.16")}, 67 sin_addr=inet_addr("10.113.29.2")}, 15 sin_addr=inet_addr("10.113.29.3")}, 63 sin_addr=inet_addr("10.113.29.4")}, 24 sin_addr=inet_addr("10.113.29.5")}, 18 sin_addr=inet_addr("10.113.29.6")}, 24 sin_addr=inet_addr("10.113.29.7")}, 20 sin_addr=inet_addr("10.113.29.8")}, 39 sin_addr=inet_addr("10.113.29.9")}, 9 sin_addr=inet_addr("10.114.29.1")}, 3 sin_addr=inet_addr("10.114.29.10")}, 3 sin_addr=inet_addr("10.114.29.14")}, 294 sin_addr=inet_addr("10.114.29.2")}, 21 sin_addr=inet_addr("10.114.29.3")}, 3 sin_addr=inet_addr("10.114.29.4")}, 33 sin_addr=inet_addr("10.114.29.5")}, 25 sin_addr=inet_addr("10.114.29.8")}, On Tue, 28 Sept 2021 at 18:15, David Caro <dcaro@xxxxxxxxxxxxx> wrote: > > Just curious, does it always happen with the same OSDs? > > On 09/28 16:14, Javier Cacheiro wrote: > > Interestingly enough this happens for some pools and not for others. > > > > For example I have just realized that when trying to connect to another > > pool the client is correctly directed to the OSD public_network address: > > > > >> strace -f -e trace=network -s 10000 rbd ls --pool cinder-volumes > --name > > client.cinder 2>&1| grep sin_addr > > [pid 2363212] connect(15, {sa_family=AF_INET, sin_port=htons(6816), > > sin_addr=inet_addr("*10.113.29.7*")}, 16) = 0 > > > > But same client listing the ephemeral-vms pools is directed to the OSD > > cluster address: > > >> strace -f -e trace=network -s 10000 rbd ls --pool ephemeral-vms --name > > client.cinder 2>&1| grep sin_addr > > [pid 2363485] connect(14, {sa_family=AF_INET, sin_port=htons(6806), > > sin_addr=inet_addr("*10.114.29.10*")}, 16) = -1 EINPROGRESS (Operation > now > > in progress) > > > > Very weird! > > > > > > > > On Tue, 28 Sept 2021 at 16:02, Javier Cacheiro < > Javier.Cacheiro@xxxxxxxxx> > > wrote: > > > > > Hi all, > > > > > > I am trying to understand a issue with ceph directing clients to > connect > > > to OSDs through their cluster_network address instead of their > > > public_network address. > > > > > > I have a configured a ceph cluster with a public and cluster network: > > > > > > >> ceph config dump|grep network > > > global advanced cluster_network *10.114.0.0/16 > > > <http://10.114.0.0/16>* * > > > mon advanced public_network 10.113.0.0/16 * > > > > > > I upgraded the cluster from 16.2.4 to 16.2.6. > > > > > > After that, I am seeing that ceph is directing clients to connect to > OSD's > > > cluster_network address instead of their public_address: > > > > > > >> strace -f -e trace=network -s 10000 rbd ls --pool ephemeral-vms > --name > > > client.cinder > > > .... > > > [pid 2353692] connect(14, {sa_family=AF_INET, sin_port=htons(6806), > > > sin_addr=inet_addr("*10.114.29.10*")}, 16) = -1 EINPROGRESS (Operation > > > now in progress) > > > > > > In this case the client hangs because it is not able to access the > > > address, since its an internal address. > > > > > > This appeared after upgrading to 16.2.6, but I am not sure it was due > to > > > the upgrade or it was a hidden issue that appeared after the nodes were > > > rebooted. > > > > > > It can also be that I am missing something in the config, but this > config > > > was generated by the cephadm bootstrap command and not created by > hand, and > > > it worked before the upgrade/reboot so I am pretty confident with it. > > > > > > What do you think, can this be a bug or is more a misconfiguration on > my > > > side? > > > > > > Thanks, > > > Javier > > > > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > David Caro > SRE - Cloud Services > Wikimedia Foundation <https://wikimediafoundation.org/> > PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 > > "Imagine a world in which every single human being can freely share in the > sum of all knowledge. That's our commitment." > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx