Re: 16.2.6: clients being incorrectly directed to the OSDs cluster_network address

Javier Cacheiro <Javier.Cacheiro@xxxxxxxxx> · Wed, 29 Sep 2021 10:53:13 +0200

Hi David,

After some investigation, it seems that it affects all osd nodes, in a way
that the same node sometimes is being announced through the cluster_network
(10.114) and others through the public_network (10.113).

Since each node has 12 OSDs I suspect it could depends on the specific OSD.
This seems to discard a network misconfiguration because OSDs in the same
node are being announced through cluster or public network.

I will look for further information to verify this, but now I am inclined
to think that this could be a bug introduced in 16.2.6.

>> for n in {1..100}; do strace -f -e trace=network -s 10000 rbd info
benchimage-test-c27-35-$n --pool glance-images --name client.glance 2>&1|
grep sin_addr | egrep '(10.113.29|10.114.29)'| awk '{print $6}'; done |
sort | uniq -c
     15 sin_addr=inet_addr("10.113.29.1")},
     18 sin_addr=inet_addr("10.113.29.10")},
     29 sin_addr=inet_addr("10.113.29.11")},
     12 sin_addr=inet_addr("10.113.29.12")},
     45 sin_addr=inet_addr("10.113.29.13")},
     27 sin_addr=inet_addr("10.113.29.14")},
    310 sin_addr=inet_addr("10.113.29.15")},
     29 sin_addr=inet_addr("10.113.29.16")},
     67 sin_addr=inet_addr("10.113.29.2")},
     15 sin_addr=inet_addr("10.113.29.3")},
     63 sin_addr=inet_addr("10.113.29.4")},
     24 sin_addr=inet_addr("10.113.29.5")},
     18 sin_addr=inet_addr("10.113.29.6")},
     24 sin_addr=inet_addr("10.113.29.7")},
     20 sin_addr=inet_addr("10.113.29.8")},
     39 sin_addr=inet_addr("10.113.29.9")},
      9 sin_addr=inet_addr("10.114.29.1")},
      3 sin_addr=inet_addr("10.114.29.10")},
      3 sin_addr=inet_addr("10.114.29.14")},
    294 sin_addr=inet_addr("10.114.29.2")},
     21 sin_addr=inet_addr("10.114.29.3")},
      3 sin_addr=inet_addr("10.114.29.4")},
     33 sin_addr=inet_addr("10.114.29.5")},
     25 sin_addr=inet_addr("10.114.29.8")},

On Tue, 28 Sept 2021 at 18:15, David Caro <dcaro@xxxxxxxxxxxxx> wrote:

>
> Just curious, does it always happen with the same OSDs?
>
> On 09/28 16:14, Javier Cacheiro wrote:
> > Interestingly enough this happens for some pools and not for others.
> >
> > For example I have just realized that when trying to connect to another
> > pool the client is correctly directed to the OSD public_network address:
> >
> > >> strace -f -e trace=network -s 10000 rbd ls --pool cinder-volumes
> --name
> > client.cinder 2>&1| grep sin_addr
> > [pid 2363212] connect(15, {sa_family=AF_INET, sin_port=htons(6816),
> > sin_addr=inet_addr("*10.113.29.7*")}, 16) = 0
> >
> > But same client listing the ephemeral-vms pools is directed to the OSD
> > cluster address:
> > >> strace -f -e trace=network -s 10000 rbd ls --pool ephemeral-vms --name
> > client.cinder 2>&1| grep sin_addr
> > [pid 2363485] connect(14, {sa_family=AF_INET, sin_port=htons(6806),
> > sin_addr=inet_addr("*10.114.29.10*")}, 16) = -1 EINPROGRESS (Operation
> now
> > in progress)
> >
> > Very weird!
> >
> >
> >
> > On Tue, 28 Sept 2021 at 16:02, Javier Cacheiro <
> Javier.Cacheiro@xxxxxxxxx>
> > wrote:
> >
> > > Hi all,
> > >
> > > I am trying to understand a issue with ceph directing clients to
> connect
> > > to OSDs through their cluster_network address instead of their
> > > public_network address.
> > >
> > > I have a configured a ceph cluster with a public and cluster network:
> > >
> > > >> ceph config dump|grep network
> > > global   advanced  cluster_network     *10.114.0.0/16
> > > <http://10.114.0.0/16>*      *
> > >   mon    advanced  public_network      10.113.0.0/16       *
> > >
> > > I upgraded the cluster from 16.2.4 to 16.2.6.
> > >
> > > After that, I am seeing that ceph is directing clients to connect to
> OSD's
> > > cluster_network address instead of their public_address:
> > >
> > > >> strace -f -e trace=network -s 10000 rbd ls --pool ephemeral-vms
> --name
> > > client.cinder
> > > ....
> > > [pid 2353692] connect(14, {sa_family=AF_INET, sin_port=htons(6806),
> > > sin_addr=inet_addr("*10.114.29.10*")}, 16) = -1 EINPROGRESS (Operation
> > > now in progress)
> > >
> > > In this case the client hangs because it is not able to access the
> > > address, since its an internal address.
> > >
> > > This appeared after upgrading to 16.2.6, but I am not sure it was due
> to
> > > the upgrade or it was a hidden issue that appeared after the nodes were
> > > rebooted.
> > >
> > > It can also be that I am missing something in the config, but this
> config
> > > was generated by the cephadm bootstrap command and not created by
> hand, and
> > > it worked before the upgrade/reboot so I am pretty confident with it.
> > >
> > > What do you think, can this be a bug or is more a misconfiguration on
> my
> > > side?
> > >
> > > Thanks,
> > > Javier
> > >
> > >
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> --
> David Caro
> SRE - Cloud Services
> Wikimedia Foundation <https://wikimediafoundation.org/>
> PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share in the
> sum of all knowledge. That's our commitment."
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx