Re: ceph ignoring cluster/public_network when initiating TCP connections

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Liviu,

All due respect, the settings I suggested should cause the kernel to always pick the right source IP for a given destination IP, even when both NICs are connected to the same physical subnet. Except maybe if you have a default route on your private interface - you should only have one default route assigned to your public interface.

Would  you be willing to post the output of ' ip route' for one of your nodes and maybe one of your clients?

Another note:  the last time I used NAT on a server with a lot of TCP connections I ran into performance problems due to the CONNTRACK table.  While that was many kernels ago, the principle is that once any NAT rule is added the kernel has to add an entry in the CONNTRACK table for *every* TCP connection and then has to do a lookup for every packet.  At that time the CONNTRACK table size was fixed and needed to be expanded, but then the lookups got even slower.

-Dave

Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx

On 3/23/2020 12:21 AM, Liviu Sas wrote:
Hi Dave,

Thank you for the answer.

Unfortunately the issue is that ceph uses the wrong source IP address, and sends the traffic on the wrong interface anyway. Would be good if ceph could actually set the source IP address to the cluster/public IP when initiating a TCP connection.

I managed to come up with a workaround by source nating the ceph traffic to the desired IP address in the POSTROUTING table.

eg: node1:
iptables -t nat -A POSTROUTING -s 10.2.0.2 -d 10.2.1.0/24 <http://10.2.1.0/24> -j SNAT  --to 10.2.1.1 iptables -t nat -A POSTROUTING -s 10.2.0.5 -d 10.2.1.0/24 <http://10.2.1.0/24> -j SNAT  --to 10.2.1.1

node2:
iptables -t nat -A POSTROUTING -s 10.2.0.6 -d 10.2.1.0/24 <http://10.2.1.0/24> -j SNAT  --to 10.2.1.2 iptables -t nat -A POSTROUTING -s 10.2.0.9 -d 10.2.1.0/24 <http://10.2.1.0/24> -j SNAT  --to 10.2.1.2

node3:
iptables -t nat -A POSTROUTING -s 10.2.0.10 -d 10.2.1.0/24 <http://10.2.1.0/24> -j SNAT  --to 10.2.1.3 iptables -t nat -A POSTROUTING -s 10.2.0.1 -d 10.2.1.0/24 <http://10.2.1.0/24> -j SNAT  --to 10.2.1.3

Where 10.2.0.x is the IP address of the interfaces that should not be used.

I still need to thoroughly test it tho.


On Mon, Mar 23, 2020 at 4:59 PM Dave Hall <kdhall@xxxxxxxxxxxxxx <mailto:kdhall@xxxxxxxxxxxxxx>> wrote:

    Liviu,

    I've found that for Linux systems with multiple NICs the default
    kernel
    settings allow the behavior you're seeing. To prevent this I
    always add
    the following to my /etc/sysctl settings, usually in
    /etc/sysctl.d/rp_filter.conf:

        net.ipv4.conf.default.rp_filter=1
        net.ipv4.conf.all.rp_filter=1

        net.ipv4.conf.all.arp_ignore=1
        net.ipv4.conf.all.arp_announce=2

    The rp_filter lines have to do with keeping packets going in and
    out of
    the interface that matches the IP.  The two ARP lines have to do with
    making sure that only the correct interface responds to ARP requests.

    -Dave

    Dave Hall
    Binghamton University
    kdhall@xxxxxxxxxxxxxx <mailto:kdhall@xxxxxxxxxxxxxx>

    On 3/22/2020 8:03 PM, Liviu Sas wrote:
    > Hello,
    >
    > While testing our ceph cluster setup, I noticed a possible issue
    with the
    > cluster/public network configuration being ignored for TCP session
    > initiation.
    >
    > Looks like the daemons (mon/mgr/mds/osd) are all listening on
    the right IP
    > address but are initiating TCP sessions from the wrong interfaces.
    > Would it be possible to force ceph daemons to use the
    cluster/public IP
    > addresses to initiate new TCP connections instead of letting the
    kernel
    > chose?
    >
    > Some details below:
    >
    > We set everything up to use our "10.2.1.0/24
    <http://10.2.1.0/24>" network:
    > 10.2.1.x (x=node number 1,2,3)
    > But we can see TCP sessions being initiated from "10.2.0.0/24
    <http://10.2.0.0/24>" network.
    >
    > So the daemons are listening to the right IP addresses.
    > root@nbs-vp-01:~# lsof -nPK i | grep ceph | grep LISTE
    > ceph-mds  1541648             ceph   16u     IPv4     8169344
    >   0t0        TCP 10.2.1.1:6800 <http://10.2.1.1:6800> (LISTEN)
    > ceph-mds  1541648             ceph   17u     IPv4     8169346
    >   0t0        TCP 10.2.1.1:6801 <http://10.2.1.1:6801> (LISTEN)
    > ceph-mgr  1541654             ceph   25u     IPv4     8163039
    >   0t0        TCP 10.2.1.1:6810 <http://10.2.1.1:6810> (LISTEN)
    > ceph-mgr  1541654             ceph   27u     IPv4     8163051
    >   0t0        TCP 10.2.1.1:6811 <http://10.2.1.1:6811> (LISTEN)
    > ceph-mon  1541703             ceph   27u     IPv4     8170914
    >   0t0        TCP 10.2.1.1:3300 <http://10.2.1.1:3300> (LISTEN)
    > ceph-mon  1541703             ceph   28u     IPv4     8170915
    >   0t0        TCP 10.2.1.1:6789 <http://10.2.1.1:6789> (LISTEN)
    > ceph-osd  1541711             ceph   16u     IPv4     8169353
    >   0t0        TCP 10.2.1.1:6802 <http://10.2.1.1:6802> (LISTEN)
    > ceph-osd  1541711             ceph   17u     IPv4     8169357
    >   0t0        TCP 10.2.1.1:6803 <http://10.2.1.1:6803> (LISTEN)
    > ceph-osd  1541711             ceph   18u     IPv4     8169362
    >   0t0        TCP 10.2.1.1:6804 <http://10.2.1.1:6804> (LISTEN)
    > ceph-osd  1541711             ceph   19u     IPv4     8169368
    >   0t0        TCP 10.2.1.1:6805 <http://10.2.1.1:6805> (LISTEN)
    > ceph-osd  1541711             ceph   20u     IPv4     8169375
    >   0t0        TCP 10.2.1.1:6806 <http://10.2.1.1:6806> (LISTEN)
    > ceph-osd  1541711             ceph   21u     IPv4     8169383
    >   0t0        TCP 10.2.1.1:6807 <http://10.2.1.1:6807> (LISTEN)
    > ceph-osd  1541711             ceph   22u     IPv4     8169392
    >   0t0        TCP 10.2.1.1:6808 <http://10.2.1.1:6808> (LISTEN)
    > ceph-osd  1541711             ceph   23u     IPv4     8169402
    >   0t0        TCP 10.2.1.1:6809 <http://10.2.1.1:6809> (LISTEN)
    >
    > Sessions to the other nodes use the wrong IP address:
    >
    > @nbs-vp-01:~# lsof -nPK i | grep ceph | grep 10.2.1.2
    > ceph-mds  1541648             ceph   28u     IPv4     8279520
    >   0t0        TCP 10.2.0.2:44180->10.2.1.2:6800
    <http://10.2.1.2:6800> (ESTABLISHED)
    > ceph-mgr  1541654             ceph   41u     IPv4     8289842
    >   0t0        TCP 10.2.0.2:44146->10.2.1.2:6800
    <http://10.2.1.2:6800> (ESTABLISHED)
    > ceph-mon  1541703             ceph   40u     IPv4     8174827
    >   0t0        TCP 10.2.0.2:40864->10.2.1.2:3300
    <http://10.2.1.2:3300> (ESTABLISHED)
    > ceph-osd  1541711             ceph   65u     IPv4     8171035
    >   0t0        TCP 10.2.0.2:58716->10.2.1.2:6804
    <http://10.2.1.2:6804> (ESTABLISHED)
    > ceph-osd  1541711             ceph   66u     IPv4     8172960
    >   0t0        TCP 10.2.0.2:54586->10.2.1.2:6806
    <http://10.2.1.2:6806> (ESTABLISHED)
    > root@nbs-vp-01:~# lsof -nPK i | grep ceph | grep 10.2.1.3
    > ceph-mds  1541648             ceph   30u     IPv4     8292421
    >   0t0        TCP 10.2.0.2:45710->10.2.1.3:6802
    <http://10.2.1.3:6802> (ESTABLISHED)
    > ceph-mon  1541703             ceph   46u     IPv4     8173025
    >   0t0        TCP 10.2.0.2:40164->10.2.1.3:3300
    <http://10.2.1.3:3300> (ESTABLISHED)
    > ceph-osd  1541711             ceph   67u     IPv4     8173043
    >   0t0        TCP 10.2.0.2:56920->10.2.1.3:6804
    <http://10.2.1.3:6804> (ESTABLISHED)
    > ceph-osd  1541711             ceph   68u     IPv4     8171063
    >   0t0        TCP 10.2.0.2:41952->10.2.1.3:6806
    <http://10.2.1.3:6806> (ESTABLISHED)
    > ceph-osd  1541711             ceph   69u     IPv4     8178891
    >   0t0        TCP 10.2.0.2:57890->10.2.1.3:6808
    <http://10.2.1.3:6808> (ESTABLISHED)
    >
    >
    > See below our cluster config:
    >
    > [global]
    >           auth_client_required = cephx
    >           auth_cluster_required = cephx
    >           auth_service_required = cephx
    >           cluster_network = 10.2.1.0/24 <http://10.2.1.0/24>
    >           fsid = 0f19b6ff-0432-4c3f-b0cb-730e8302dc2c
    >           mon_allow_pool_delete = true
    >           mon_host = 10.2.1.1 10.2.1.2 10.2.1.3
    >           osd_pool_default_min_size = 2
    >           osd_pool_default_size = 3
    >           public_network = 10.2.1.0/24 <http://10.2.1.0/24>
    >
    > [client]
    >           keyring = /etc/pve/priv/$cluster.$name.keyring
    >
    > [mds]
    >           keyring = /var/lib/ceph/mds/ceph-$id/keyring
    >
    > [mds.nbs-vp-01]
    >           host = nbs-vp-01
    >           mds_standby_for_name = pve
    >
    > [mds.nbs-vp-03]
    >           host = nbs-vp-03
    >           mds standby for name = pve
    >
    > [osd.0]
    >          public addr = 10.2.1.1
    >          cluster addr = 10.2.1.1
    >
    > [osd.1]
    >          public addr = 10.2.1.2
    >          cluster addr = 10.2.1.2
    >
    > [osd.2]
    >          public addr = 10.2.1.3
    >          cluster addr = 10.2.1.3
    >
    > [mgr.nbs-vp-01]
    >          public addr = 10.2.1.1
    >
    > [mgr.nbs-vp-02]
    >          public addr = 10.2.1.2
    >
    > [mgr.nbs-vp-03]
    >          public addr = 10.2.1.3
    >
    > [mon.nbs-vp-01]
    >          public addr = 10.2.1.1
    >
    > [mon.nbs-vp-02]
    >          public addr = 10.2.1.2
    >
    > [mon.nbs-vp-03]
    >          public addr = 10.2.1.3
    >
    > Cheers,
    > Liviu
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux