Hello Grant, thanks for your reply. I'll respond to your questions inline. Am 25.01.2018 um 01:34 schrieb Grant Taylor: > On 01/24/2018 12:54 PM, Matthias Walther wrote: >> Hello, > > Hi, > >> I used to nat GRE-tunnels into a kvm machine. That used to work >> perfectly, till it stopped working in early January. > > Okay. :-/ > > Can I get a high level overview of your network topology? You've > mentioned bridges, eth0, and VMs. - I figure asking is better than > speculating. We're running gateways for an open wifi project here in Germany, it's called Freifunk (freifunk.net), it's non commercial. We connect those gateways with our AS exit routers via GRE tunnels, GRE over IPv4. To save money and ressources, we virtualize the hardware with KVM. Usually we have an extra IPv4 address for each virtual machine. In two experimental cases I tried to spare the ipv4 address and nat the gre tunnels from the hypervisor's public ip address and only give the virtual machine a private ip address (192.168....). Standard destination nat with the iptables rule as mentioned. The bridges are created with brctl and the topology in this paticular case looks as following: root@unimatrixzero ~ # brctl show bridge name bridge id STP enabled interfaces br0 8000.fe540028664d no vnet2 vnet3 vnet5 vnet6 virbr1 8000.5254007bec03 yes virbr1-nic vnet4 The hoster is Hetzner, a German budget hosting company. The do not block GRE tunnels. GRE to public ip addresses just work fine. As this hypervisor contains both virtual machines with public ip addresses and private (192.168...) ip addresses, we have two bridges. Depending on the configuration, the virtual machines are in br0 (public ip addresses) and the ones with private addresses in virbr1. >> I'm not really sure, what caused this malfunction. I tried different >> kernel versions, 4.4.113, 4.10.0-35, 4.10.0-37, 4.14. All on ubuntu >> 16.04.3. > > Do you know specifically when things stopped working as desired? Have > you tried the kernel that you were running before that? Are you aware > of anything that changed on the system about that time? I.e. updates? > Kernel versions? Unfortunately not. We're running unattended upgrades on the machines. It's a free time project and we don't have the man power to updated all our hosts manually. I'm not even sure, weather the kernel was updated or not. I tried the oldest kernel still available on the machine and a much older kernel 4.4. Ubuntu automatically deinstalls unneeded, older kernels. Maybe a security patch, that got applied to 4.4, aswell as 4.10 and 4.13 and 4.14 destroyed the case. Maybe I should try an older 4.4 kernel, not revision 113. But I can say for sure, that we had two experimental machines running this configuration with natted gre tunnels and both stopped working around the same time after this had worked stabily for several months. > >> Normal destination based nat rules, like ssh tcp 22 e. g., work >> perfectly. That gre nat rule is in place: >> >> -A PREROUTING -i eth0 -p gre -j DNAT --to-destination 192.168.10.62 >> >> And the needed kernel modules are loaded: >> >> root# lsmod|grep gre >> 61:nf_conntrack_proto_gre 16384 0 >> 62:nf_nat_proto_gre 16384 0 >> 63:nf_nat 24576 4 >> nf_nat_proto_gre,nf_nat_ipv4,xt_nat,nf_nat_masquerade_ipv4 >> 64:nf_conntrack 106496 6 >> nf_conntrack_proto_gre,nf_nat,nf_nat_ipv4,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4 >> >> Still some packes are just not correctly natted. The configuration >> should be correct, as it used to work like this. > > Please provide a high level packet flow as you think that it should > be. I.e. GRE encaped comes in eth0 … does something … gets DNATed to > $IP … goes out somewhere. I was pinging from the inside of the VM into the GRE tunnel. So the packet flow is as follows: ICMP packet goes into virtual GRE interface within the virtual machine. Then it is encasuplated with the private ip address as source and send out through eth0 of the virtual machine. The packet is now in the network stack of the hypervisor. Comming in through vnet4, going through virbr1-bridge. Then it should be natted, so the private source address of the gre packet should be replaced by the public ip address of the hypervisor. Then the natted packet is sent out to the other end of the gre tunnel somewhere on the internet. The last step, the nat and the sending though the physical interface is what doesn't happen. >> One or two tunnels usually work. For the others, the gre packages are >> just not natted but dropped. First example, which shows the expected >> behavior: > > Are you saying that one or two tunnels at a time work? As if it may > be a load / state cache related problem? Or that some specific > tunnels seem to work. > > Do the tunnels that seem to work do so all the time? Funnily, after each reboot a different tunnel seemed to work. All tunnels do the same, they're just going to different backbone upstream severs for redundancy. That's why we're not sure when the problem first occured. Due to the fact that everything seemed to work fine, because one tunnel is enough, the problem hadn't been discovered directly. Now it stopped working completly. > >> root# tcpdump -ni any host 185.66.195.1 and \( host 176.9.38.150 or >> host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and \( >> ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \) >> tcpdump: verbose output suppressed, use -v or -vv for full protocol >> decode >> listening on any, link-type LINUX_SLL (Linux cooked), capture size >> 262144 bytes >> 04:06:41.322914 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP >> 185.66.194.49 > 100.64.7.186: ICMP echo request, id 26639, seq 1, >> length 64 >> 04:06:41.322922 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP >> 185.66.194.49 > 100.64.7.186: ICMP echo request, id 26639, seq 1, >> length 64 >> 04:06:41.322928 IP 176.9.38.150 > 185.66.195.1: GREv0, length 88: IP >> 185.66.194.49 > 100.64.7.186: ICMP echo request, id 26639, seq 1, >> length 64 >> 04:06:41.341906 IP 185.66.195.1 > 176.9.38.150: GREv0, length 88: IP >> 100.64.7.186 > 185.66.194.49: ICMP echo reply, id 26639, seq 1, >> length 64 >> 04:06:41.341915 IP 185.66.195.1 > 192.168.10.62: GREv0, length 88: IP >> 100.64.7.186 > 185.66.194.49: ICMP echo reply, id 26639, seq 1, >> length 64 >> 04:06:41.341918 IP 185.66.195.1 > 192.168.10.62: GREv0, length 88: IP >> 100.64.7.186 > 185.66.194.49: ICMP echo reply, id 26639, seq 1, >> length 64 > > Would you please re-capture, both working and non-working, but > specific to one interface? I.e. -i eth0 and -i $outGoingInterface as > separate captures? (Or if there is a way to get tcpdump to show the > interface in the textual output.) Unfortunatley, I can't provide a working example as since I tested all those different kernel version, nothing works anymore. Not a single tunnel, even though I went back to 4.13.0-31 with which I had captured the packets yesterday. (As I rebooted again, vnet4 is now vnet0.) See here the three steps seperatly: root@unimatrixzero ~ # tcpdump -ni vnet0 host 185.66.195.1 and \( host 176.9.38.150 or host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and \( ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \) tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vnet0, link-type EN10MB (Ethernet), capture size 262144 bytes 08:29:15.127873 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP 185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 59, length 64 08:29:16.151856 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP 185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 60, length 64 08:29:17.175800 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP 185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 61, length 64 08:29:18.199780 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP 185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 62, length 64 ^C 4 packets captured 4 packets received by filter 0 packets dropped by kernel root@unimatrixzero ~ # tcpdump -ni virbr1 host 185.66.195.1 and \( host 176.9.38.150 or host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and \( ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \) tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on virbr1, link-type EN10MB (Ethernet), capture size 262144 bytes 08:29:33.495592 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP 185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 77, length 64 08:29:34.519567 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP 185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 78, length 64 08:29:35.543572 IP 192.168.10.62 > 185.66.195.1: GREv0, length 88: IP 185.66.194.49 > 100.64.7.186: ICMP echo request, id 18763, seq 79, length 64 ^C 3 packets captured 3 packets received by filter 0 packets dropped by kernel root@unimatrixzero ~ # tcpdump -ni eth0 host 185.66.195.1 and \( host 176.9.38.150 or host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and \( ip[36:4]==0x644007BA or ip[40:4]==0x644007BA \) tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes ^C 0 packets captured 10 packets received by filter 0 packets dropped by kernel The GRE packets go through the interface and through the bridge, but the GRE packet isn't natted and never send out through the physical interface (=eth0) on the hypervisor. All those tcpdumps are made on the hypervisor. In the first example, where the nat worked, we've see those three steps aswell. And the packets went out through eth0, got an icmp reply which took the reverse path to it's destination, the virtual machine, where the gre got decapsulated and ping got its result package. I made sure, that the nf_nat_proto_gre and nf_conntrack_proto_gre modules are loaded. Lsmod shows them. >> This^^ works as it should. The packet goes through the bridge >> interface, then the bridge though which all natted vms are connected, >> then it is translated and then through the eth0 interface of the >> hypervisor. And the reply packages follows in reverse direction. The >> nat works, the address is translated. Not so in the second case: > > What type of bridge are you using? Standard Linux bridging, ala brctl > and or ip? Or are you using Open vSwitch, or something else? Standard Linux brctl as virsh and virsh manager create them. > > Can we see a config dump of the bridge? Virsh creates the bridge based on this xml file: virsh # net-dumpxml ipv4-nat <network> <name>ipv4-nat</name> <uuid>2c0daba2-1e17-4d0d-9b9e-2acf09435da6</uuid> <forward mode='nat'> <nat> <port start='1024' end='65535'/> </nat> </forward> <bridge name='virbr1' stp='on' delay='0'/> <mac address='52:54:00:7b:ec:03'/> <ip address='192.168.10.1' netmask='255.255.255.0'> <dhcp> <range start='192.168.10.2' end='192.168.10.254'/> </dhcp> </ip> </network> > > I wonder if a sysctl (/proc) setting got changed and now IPTables is > trying to filter bridged traffic. I think it's > /proc/sys/net/bridge/bridge-nf-call-iptables. (At least that's what > I'm seeing with a quick Google search.) This entry doesn't exist here. root@unimatrixzero ~ # cat /proc/sys/net/ core/ ipv6/ nf_conntrack_max ipv4/ netfilter/ unix/ There is no bridge, or virbr1 entry in ipv4 either. Nor did I find something familiar in /netfilter/. > > Can we see the output of iptables-save? root@unimatrixzero ~ # cat /etc/iptables/rules.v4 # Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017 *raw :PREROUTING ACCEPT [4134062347:2804377965525] :OUTPUT ACCEPT [45794:9989552] -A PREROUTING -d 192.168.0.0/24 -p tcp -m tcp --dport 80 -j TRACE -A PREROUTING -d 192.168.10.0/24 -p tcp -m tcp --dport 222 -j TRACE -A OUTPUT -d 192.168.0.0/24 -p tcp -m tcp --dport 80 -j TRACE -A OUTPUT -d 192.168.10.0/24 -p tcp -m tcp --dport 222 -j TRACE COMMIT # Completed on Fri Oct 27 23:36:29 2017 # Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017 *mangle :PREROUTING ACCEPT [4134063569:2804378696201] :INPUT ACCEPT [48005:5510967] :FORWARD ACCEPT [4133838276:2804349602217] :OUTPUT ACCEPT [45797:9990176] :POSTROUTING ACCEPT [4133884073:2804359592393] -A POSTROUTING -o virbr1 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill COMMIT # Completed on Fri Oct 27 23:36:29 2017 # Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017 *nat :PREROUTING ACCEPT [86097:5109916] :INPUT ACCEPT [7557:460113] :OUTPUT ACCEPT [162:11119] :POSTROUTING ACCEPT [78890:4669843] -A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 222 -j DNAT --to-destination 192.168.10.62:22 -A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 192.168.10.248:80 -A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 443 -j DNAT --to-destination 192.168.10.248:443 -A PREROUTING -d 176.9.38.150/32 -p tcp -m tcp --dport 223 -j DNAT --to-destination 192.168.10.248:22 -A PREROUTING -i eth0 -p gre -j DNAT --to-destination 192.168.10.62 -A PREROUTING -d 176.9.38.150/32 -p udp -m udp --dport 20000:20100 -j DNAT --to-destination 192.168.10.62:20000-20100 -A POSTROUTING -s 192.168.10.0/24 -d 224.0.0.0/24 -j RETURN -A POSTROUTING -s 192.168.10.0/24 -d 255.255.255.255/32 -j RETURN -A POSTROUTING -s 192.168.10.0/24 ! -d 192.168.10.0/24 -p tcp -j MASQUERADE --to-ports 1024-65535 -A POSTROUTING -s 192.168.10.0/24 ! -d 192.168.10.0/24 -p udp -j MASQUERADE --to-ports 1024-65535 -A POSTROUTING -s 192.168.10.0/24 ! -d 192.168.10.0/24 -j MASQUERADE COMMIT # Completed on Fri Oct 27 23:36:29 2017 # Generated by iptables-save v1.6.0 on Fri Oct 27 23:36:29 2017 *filter :INPUT ACCEPT [47667:5451204] :FORWARD ACCEPT [4133512236:2804145422827] :OUTPUT ACCEPT [45662:9946618] -A INPUT -i virbr1 -p udp -m udp --dport 53 -j ACCEPT -A INPUT -i virbr1 -p tcp -m tcp --dport 53 -j ACCEPT -A INPUT -i virbr1 -p udp -m udp --dport 67 -j ACCEPT -A INPUT -i virbr1 -p tcp -m tcp --dport 67 -j ACCEPT -A FORWARD -d 192.168.10.0/24 -o virbr1 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT -A FORWARD -s 192.168.10.0/24 -i virbr1 -j ACCEPT -A FORWARD -i virbr1 -o virbr1 -j ACCEPT -A FORWARD -d 192.168.10.0/24 -m state --state NEW,RELATED,ESTABLISHED -j ACCEPT -A FORWARD -d 192.168.10.0/24 -i eth0 -o virbr1 -m state --state RELATED,ESTABLISHED -j ACCEPT -A FORWARD -s 192.168.10.0/24 -i virbr1 -o eth0 -j ACCEPT -A FORWARD -i virbr1 -o virbr1 -j ACCEPT -A OUTPUT -o virbr1 -p udp -m udp --dport 68 -j ACCEPT COMMIT # Completed on Fri Oct 27 23:36:29 2017 > >> root@# tcpdump -ni any host 185.66.195.0 and \( host 176.9.38.150 or >> host 192.168.10.62 \) and proto 47 and ip[33]=0x01 and \( >> ip[36:4]==0x644007B4 or ip[40:4]==0x644007B4 \) >> tcpdump: verbose output suppressed, use -v or -vv for full protocol >> decode >> listening on any, link-type LINUX_SLL (Linux cooked), capture size >> 262144 bytes >> 03:58:01.972551 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP >> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 1, >> length 64 >> 03:58:01.972554 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP >> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 1, >> length 64 >> 03:58:03.001013 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP >> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 2, >> length 64 >> 03:58:03.001021 IP 192.168.10.62 > 185.66.195.0: GREv0, length 88: IP >> 185.66.194.49 > 100.64.7.180: ICMP echo request, id 25043, seq 2, >> length 64 >> >> tcpdump catches the outgoing package. But instead of being >> translated, it's dropped. > > We can't tell from the above output if it's traffic coming into the > outside interface (eth0?) or traffic leaving the inside interface > (connected to the bridge?). > > What hypervisor are you using? KVM, VirtualBox, something else? How > do the VMs connect to the bridge? KVM. KVM creates the interface on the hypervisor and puts it into the bridge. > > Also, if you're bridging, why are you DNATing packets? - Or is your > bridge internal only and you're DNATing between the outside (eth0) and > the internal (only) bridge where the VMs are connected? The bridge is a natted /24 subnet created by kvm. All VM that don't have a public address are connected to the bridge which nats the outgoing connections just like a standard home router would do. A bridge isn't necessary here. It just makes things easier. You could route each virtual machine seperately. It's just kvm's approch do make things smothier. > > It sort of looks like you may have a one to one mapping of outside IPs > to inside IPs. - Which makes me ask the question why you're DNATing > in the first place. Or rather why you aren't bridging the VMs to the > outside and running the globally routed IP directly in the VMs. Our standard configuration is to have a seperate global IPv4 for each virtual machine. We experimented with natting those GRE tunnels so save one ip address per hypervisor, which worked perfectly so far. Freifunk is not just a wifi network. It's about getting to know network stuff like mesh networks or software defined networks based on GRE tunnels. My reasons to participate are mostly to understand the technology behind all that. As I wrote in my other email, I looked into the source code. As far as I understand it, the GREv0 Nat has never been properly implemented. I don't understand how this ever worked. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/net/ipv4/netfilter/nf_nat_proto_gre.c?id=HEAD But GRE-natting is possible. Even my internet provider's 50 Euro router can do it. Thanks for your help! Regards, Matthias > >> Any ideas, how I could analyse this? All tested kernels showed the >> exact same behavior. It's as if only one gre nat connection was >> possible. > > I need more details to be able to start poking further. > > > -- To unsubscribe from this list: send the line "unsubscribe lartc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html