Re: GRE-NAT broken

Linux Advanced Routing and Traffic Control

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Grant,

thanks for your reply. I'll respond to your questions inline.

Am 25.01.2018 um 01:34 schrieb Grant Taylor:
> On 01/24/2018 12:54 PM, Matthias Walther wrote:
>> Hello,
>
> Hi,
>
>> I used to nat GRE-tunnels into a kvm machine. That used to work
>> perfectly, till it stopped working in early January.
>
> Okay.  :-/
>
> Can I get a high level overview of your network topology?  You've
> mentioned bridges, eth0, and VMs.  -  I figure asking is better than
> speculating.
We're running gateways for an open wifi project here in Germany, it's
called Freifunk (freifunk.net), it's non commercial. We connect those
gateways with our AS exit routers via GRE tunnels, GRE over IPv4.

To save money and ressources, we virtualize the hardware with KVM.
Usually we have an extra IPv4 address for each virtual machine. In two
experimental cases I tried to spare the ipv4 address and nat the gre
tunnels from the hypervisor's public ip address and only give the
virtual machine a private ip address (192.168....). Standard destination
nat with the iptables rule as mentioned.

The bridges are created with brctl and the topology in this paticular
case looks as following:

root@unimatrixzero ~ # brctl show
bridge name    bridge id        STP enabled    interfaces
br0        8000.fe540028664d    no        vnet2
                            vnet3
                            vnet5
                            vnet6
virbr1        8000.5254007bec03    yes        virbr1-nic
                            vnet4

The hoster is Hetzner, a German budget hosting company. The do not block
GRE tunnels. GRE to public ip addresses just work fine. As this
hypervisor contains both virtual machines with public ip addresses and
private (192.168...) ip addresses, we have two bridges. Depending on the
configuration, the virtual machines are in br0 (public ip addresses) and
the ones with private addresses in virbr1.


>> I'm not really sure, what caused this malfunction. I tried different
>> kernel versions, 4.4.113, 4.10.0-35, 4.10.0-37, 4.14. All on ubuntu
>> 16.04.3.
>
> Do you know specifically when things stopped working as desired?  Have
> you tried the kernel that you were running before that?  Are you aware
> of anything that changed on the system about that time?  I.e. updates?
> Kernel versions?
Unfortunately not. We're running unattended upgrades on the machines.
It's a free time project and we don't have the man power to updated all
our hosts manually. I'm not even sure, weather the kernel was updated or
not. I tried the oldest kernel still available on the machine  and a
much older kernel 4.4. Ubuntu automatically deinstalls unneeded, older
kernels. Maybe a security patch, that got applied to 4.4, aswell as 4.10
and 4.13 and 4.14 destroyed the case. Maybe I should try an older 4.4
kernel, not revision 113.

But I can say for sure, that we had two experimental machines running
this configuration with natted gre tunnels and both stopped working
around the same time after this had worked stabily for several months.
>
>> Normal destination based nat rules, like ssh tcp 22 e. g., work
>> perfectly. That gre nat rule is in place:
>>
>> -A PREROUTING -i eth0 -p gre -j DNAT --to-destination 192.168.10.62
>>
>> And the needed kernel modules are loaded:
>>
>> root# lsmod|grep gre
>> 61:nf_conntrack_proto_gre    16384  0
>> 62:nf_nat_proto_gre       16384  0
>> 63:nf_nat                 24576  4
>> nf_nat_proto_gre,nf_nat_ipv4,xt_nat,nf_nat_masquerade_ipv4
>> 64:nf_conntrack          106496  6
>> nf_conntrack_proto_gre,nf_nat,nf_nat_ipv4,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4
>>
>> Still some packes are just not correctly natted. The configuration
>> should be correct, as it used to work like this.
>
> Please provide a high level packet flow as you think that it should
> be. I.e. GRE encaped comes in eth0 … does something … gets DNATed to
> $IP … goes out somewhere.
I was pinging from the inside of the VM into the GRE tunnel. So the
packet flow is as follows:

ICMP packet goes into virtual GRE interface within the virtual machine.
Then it is encasuplated with the private ip address as source and send
out through eth0 of the virtual machine.

The packet is now in the network stack of the hypervisor. Comming in
through vnet4, going through virbr1-bridge. Then it should be natted, so
the private source address of the gre packet should be replaced by the
public ip address of the hypervisor. Then the natted packet is sent out
to the other end of the gre tunnel somewhere on the internet. The last
step, the nat and the sending though the physical interface is what
doesn't happen.
>> One or two tunnels usually work. For the others, the gre packages are
>> just not natted but dropped. First example, which shows the expected
>> behavior:
>
> Are you saying that one or two tunnels at a time work?  As if it may
> be a load / state cache related problem?  Or that some specific
> tunnels seem to work.
>
> Do the tunnels that seem to work do so all the time?
Funnily, after each reboot a different tunnel seemed to work. All
tunnels do the same, they're just going to different backbone upstream
severs for redundancy.

That's why we're not sure when the problem first occured. Due to the
fact that everything seemed to work fine, because one tunnel is enough,
the problem hadn't been discovered directly. Now it stopped working
completly.
>
>> root# tcpdump -ni any host 185.66.195.1 and \( host 176.9.38.150 or
>> host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and \(
>> ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \)
>> tcpdump: verbose output suppressed, use -v or -vv for full protocol
>> decode
>> listening on any, link-type LINUX_SLL (Linux cooked), capture size
>> 262144 bytes
>> 04:06:41.322914 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
>> 185.66.194.49 > 100.64.7.186: ICMP echo request, id 26639, seq 1,
>> length 64
>> 04:06:41.322922 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
>> 185.66.194.49 > 100.64.7.186: ICMP echo request, id 26639, seq 1,
>> length 64
>> 04:06:41.322928 IP 176.9.38.150 > 185.66.195.1: GREv0, length 88: IP
>> 185.66.194.49 > 100.64.7.186: ICMP echo request, id 26639, seq 1,
>> length 64
>> 04:06:41.341906 IP 185.66.195.1 > 176.9.38.150: GREv0, length 88: IP
>> 100.64.7.186 > 185.66.194.49: ICMP echo reply, id 26639, seq 1,
>> length 64
>> 04:06:41.341915 IP 185.66.195.1 > 192.168.10.62: GREv0, length 88: IP
>> 100.64.7.186 > 185.66.194.49: ICMP echo reply, id 26639, seq 1,
>> length 64
>> 04:06:41.341918 IP 185.66.195.1 > 192.168.10.62: GREv0, length 88: IP
>> 100.64.7.186 > 185.66.194.49: ICMP echo reply, id 26639, seq 1,
>> length 64
>
> Would you please re-capture, both working and non-working, but
> specific to one interface?  I.e. -i eth0 and -i $outGoingInterface as
> separate captures?  (Or if there is a way to get tcpdump to show the
> interface in the textual output.)
Unfortunatley, I can't provide a working example as since I tested all
those different kernel version, nothing works anymore. Not a single
tunnel, even though I went back to 4.13.0-31 with which I had captured
the packets yesterday.

(As I rebooted again, vnet4 is now vnet0.) See here the three steps
seperatly:

root@unimatrixzero ~ #  tcpdump -ni vnet0 host 185.66.195.1 and \( host
176.9.38.150 or host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and
\( ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \)
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:29:15.127873 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 59, length 64
08:29:16.151856 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 60, length 64
08:29:17.175800 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 61, length 64
08:29:18.199780 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 62, length 64
^C
4 packets captured
4 packets received by filter
0 packets dropped by kernel
root@unimatrixzero ~ #  tcpdump -ni virbr1 host 185.66.195.1 and \( host
176.9.38.150 or host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and
\( ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \)
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on virbr1, link-type EN10MB (Ethernet), capture size 262144 bytes
08:29:33.495592 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 77, length 64
08:29:34.519567 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 78, length 64
08:29:35.543572 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP
185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 79, length 64
^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel
root@unimatrixzero ~ #  tcpdump -ni eth0 host 185.66.195.1 and \( host
176.9.38.150 or host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and
\( ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \)
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
10 packets received by filter
0 packets dropped by kernel

The GRE packets go through the interface and through the bridge, but the
GRE packet isn't natted and never send out through the physical
interface (=eth0) on the hypervisor. All those tcpdumps are made on the
hypervisor.

In the first example, where the nat worked, we've see those three steps
aswell. And the packets went out through eth0, got an icmp reply which
took the reverse path to it's destination, the virtual machine, where
the gre got decapsulated and ping got its result package.

I made sure, that the nf_nat_proto_gre and nf_conntrack_proto_gre
modules are loaded. Lsmod shows them.


>> This^^ works as it should. The packet goes through the bridge
>> interface, then the bridge though which all natted vms are connected,
>> then it is translated and then through the eth0 interface of the
>> hypervisor. And the reply packages follows in reverse direction. The
>> nat works, the address is translated. Not so in the second case:
>
> What type of bridge are you using?  Standard Linux bridging, ala brctl
> and or ip?  Or are you using Open vSwitch, or something else?
Standard Linux brctl as virsh and virsh manager create them.
>
> Can we see a config dump of the bridge?
Virsh creates the bridge based on this xml file:
virsh # net-dumpxml ipv4-nat
<network>
  <name>ipv4-nat</name>
  <uuid>2c0daba2-1e17-4d0d-9b9e-2acf09435da6</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr1' stp='on' delay='0'/>
  <mac address='52:54:00:7b:ec:03'/>
  <ip address='192.168.10.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.10.2' end='192.168.10.254'/>
    </dhcp>
  </ip>
</network>

>
> I wonder if a sysctl (/proc) setting got changed and now IPTables is
> trying to filter bridged traffic.  I think it's
> /proc/sys/net/bridge/bridge-nf-call-iptables.  (At least that's what
> I'm seeing with a quick Google search.)
This entry doesn't exist here.

root@unimatrixzero ~ # cat /proc/sys/net/
core/             ipv6/             nf_conntrack_max 
ipv4/             netfilter/        unix/            

There is no bridge, or virbr1 entry in ipv4 either. Nor did I find
something familiar in /netfilter/.

>
> Can we see the output of iptables-save?
root@unimatrixzero ~ # cat /etc/iptables/rules.v4
# Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017
*raw
:PREROUTING ACCEPT [4134062347:2804377965525]
:OUTPUT ACCEPT [45794:9989552]
-A PREROUTING -d 192.168.0.0/24 -p tcp -m tcp --dport 80 -j TRACE
-A PREROUTING -d 192.168.10.0/24 -p tcp -m tcp --dport 222 -j TRACE
-A OUTPUT -d 192.168.0.0/24 -p tcp -m tcp --dport 80 -j TRACE
-A OUTPUT -d 192.168.10.0/24 -p tcp -m tcp --dport 222 -j TRACE
COMMIT
# Completed on Fri Oct 27 23:36:29 2017
# Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017
*mangle
:PREROUTING ACCEPT [4134063569:2804378696201]
:INPUT ACCEPT [48005:5510967]
:FORWARD ACCEPT [4133838276:2804349602217]
:OUTPUT ACCEPT [45797:9990176]
:POSTROUTING ACCEPT [4133884073:2804359592393]
-A POSTROUTING -o virbr1 -p udp -m udp --dport 68 -j CHECKSUM
--checksum-fill
COMMIT
# Completed on Fri Oct 27 23:36:29 2017
# Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017
*nat
:PREROUTING ACCEPT [86097:5109916]
:INPUT ACCEPT [7557:460113]
:OUTPUT ACCEPT [162:11119]
:POSTROUTING ACCEPT [78890:4669843]
-A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 222 -j DNAT
--to-destination 192.168.10.62:22
-A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 80 -j DNAT
--to-destination 192.168.10.248:80
-A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 443 -j DNAT
--to-destination 192.168.10.248:443
-A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 223 -j DNAT
--to-destination 192.168.10.248:22
-A PREROUTING -i eth0 -p gre -j DNAT --to-destination 192.168.10.62
-A PREROUTING -d 176.9.38.150/32 -p udp -m udp --dport 20000:20100 -j
DNAT --to-destination 192.168.10.62:20000-20100
-A POSTROUTING -s 192.168.10.0/24 -d 224.0.0.0/24 -j RETURN
-A POSTROUTING -s 192.168.10.0/24 -d 255.255.255.255/32 -j RETURN
-A POSTROUTING -s 192.168.10.0/24 ! -d 192.168.10.0/24 -p tcp -j
MASQUERADE --to-ports 1024-65535
-A POSTROUTING -s 192.168.10.0/24 ! -d 192.168.10.0/24 -p udp -j
MASQUERADE --to-ports 1024-65535
-A POSTROUTING -s 192.168.10.0/24 ! -d 192.168.10.0/24 -j MASQUERADE
COMMIT
# Completed on Fri Oct 27 23:36:29 2017
# Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017
*filter
:INPUT ACCEPT [47667:5451204]
:FORWARD ACCEPT [4133512236:2804145422827]
:OUTPUT ACCEPT [45662:9946618]
-A INPUT -i virbr1 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -i virbr1 -p tcp -m tcp --dport 53 -j ACCEPT
-A INPUT -i virbr1 -p udp -m udp --dport 67 -j ACCEPT
-A INPUT -i virbr1 -p tcp -m tcp --dport 67 -j ACCEPT
-A FORWARD -d 192.168.10.0/24 -o virbr1 -m conntrack --ctstate
RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -s 192.168.10.0/24 -i virbr1 -j ACCEPT
-A FORWARD -i virbr1 -o virbr1 -j ACCEPT
-A FORWARD -d 192.168.10.0/24 -m state --state NEW,RELATED,ESTABLISHED
-j ACCEPT
-A FORWARD -d 192.168.10.0/24 -i eth0 -o virbr1 -m state --state
RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -s 192.168.10.0/24 -i virbr1 -o eth0 -j ACCEPT
-A FORWARD -i virbr1 -o virbr1 -j ACCEPT
-A OUTPUT -o virbr1 -p udp -m udp --dport 68 -j ACCEPT
COMMIT
# Completed on Fri Oct 27 23:36:29 2017

>
>> root@# tcpdump -ni any host 185.66.195.0 and \( host 176.9.38.150 or
>> host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and \(
>> ip[36:4]==0x644007B4 or ip[40:4]==0x644007B4 \)
>> tcpdump: verbose output suppressed, use -v or -vv for full protocol
>> decode
>> listening on any, link-type LINUX_SLL (Linux cooked), capture size
>> 262144 bytes
>> 03:58:01.972551 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP
>> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 1,
>> length 64
>> 03:58:01.972554 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP
>> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 1,
>> length 64
>> 03:58:03.001013 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP
>> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 2,
>> length 64
>> 03:58:03.001021 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP
>> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 2,
>> length 64
>>
>> tcpdump catches the outgoing package. But instead of being
>> translated, it's dropped.
>
> We can't tell from the above output if it's traffic coming into the
> outside interface (eth0?) or traffic leaving the inside interface
> (connected to the bridge?).
>
> What hypervisor are you using?  KVM, VirtualBox, something else?  How
> do the VMs connect to the bridge?
KVM. KVM creates the interface on the hypervisor and puts it into the
bridge.
>
> Also, if you're bridging, why are you DNATing packets?  -  Or is your
> bridge internal only and you're DNATing between the outside (eth0) and
> the internal (only) bridge where the VMs are connected?
The bridge is a natted /24 subnet created by kvm. All VM that don't have
a public address are connected to the bridge which nats the outgoing
connections just like a standard home router would do.

A bridge isn't necessary here. It just makes things easier. You could
route each virtual machine seperately. It's just kvm's approch do make
things smothier.
>
> It sort of looks like you may have a one to one mapping of outside IPs
> to inside IPs.  -  Which makes me ask the question why you're DNATing
> in the first place.  Or rather why you aren't bridging the VMs to the
> outside and running the globally routed IP directly in the VMs.
Our standard configuration is to have a seperate global IPv4 for each
virtual machine. We experimented with natting those GRE tunnels so save
one ip address per hypervisor, which worked perfectly so far.

Freifunk is not just a wifi network. It's about getting to know network
stuff like mesh networks or software defined networks based on GRE
tunnels. My reasons to participate are mostly to understand the
technology behind all that.

As I wrote in my other email, I looked into the source code. As far as I
understand it, the GREv0 Nat has never been properly implemented. I
don't understand how this ever worked.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/net/ipv4/netfilter/nf_nat_proto_gre.c?id=HEAD

But GRE-natting is possible. Even my internet provider's 50 Euro router
can do it.

Thanks for your help!

Regards,
Matthias

>
>> Any ideas, how I could analyse this? All tested kernels showed the
>> exact same behavior. It's as if only one gre nat connection was
>> possible.
>
> I need more details to be able to start poking further.
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe lartc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [LARTC Home Page]     [Netfilter]     [Netfilter Development]     [Network Development]     [Bugtraq]     [GCC Help]     [Yosemite News]     [Linux Kernel]     [Fedora Users]
  Powered by Linux