Hi there; I'm currently facing some weird issues using multipath routing, and I'm feeling desesperate to solve them. :-( Overview: --------- We have two distinct datacenters, linked to our office network across VTUND VPNs. In our office, one linux server has two VTUN tunnels connected to our DCs (one tunnel per DC). DCs are also connected with each other using a VTUN tunnel as well. So, basically, it looks something like: Office | Firewall | VTUN_box | | ----------------------------------INTERNET | | DC1-----DC2 In this situation, everything is working just fine. However, and for redundency / load balancing reasons, we want to build the following setup. Office | Firewall | --------------- | | Vtun_Box1-------Vtun_Box2 | | | | ----------------------------------INTERNET | | | | DC1----DC2 DC1----DC2 In such setup, if Vtun_Box1 crash, all traffic going to our DCs would be redirected by the firewall through Vtun_Box2, and vice-versa. On top of this, if one or both tunnels on one Vtun Box stop working, while the Vtun box itself is still alive, it will automatically redirect all traffic through the other Vtun box. Note that both Vtun_Box are on the same network segment, that is they do have the same network address / broadcast / netmask. Only their IP addresses are different. Thus, both Vtun_Box are reached by the firewall through the same device (eth1, here) Also, the Firewall don't NAT traffic going to the DCs, since each Vtun box will already NAT everything going out through the tunnels. Now, regarding the servers settings: Firewall: --------- System: Linux, stock kernel 2.2.22 with julian's patches applied routing policy: 0: from all lookup local 50: from all lookup main 101: from all lookup prod-vpn # Traffic going to both DCs 200: from all lookup uunet # Default route 32766: from all lookup main 32767: from all lookup default Where: ip route list table prod-vpn: DC1_NET/24 proto static nexthop via Vtun_Box1 dev eth1 weight 1 nexthop via Vtun_Box2 dev eth1 weight 1 DC2_NET/24 proto static nexthop via Vtun_Box1 dev eth1 weight 1 nexthop via Vtun_Box2 dev eth1 weight 1 Vtun_Box1: ---------- System: Linux, stock kernel 2.2.19 NAT: MASQ all ------ anywhere anywhere n/a On this box, we have 172.1.1.1 as the local ip of the tunnel to DC1 172.1.2.1 as the local ip of the tunnel to DC2 Vtun_Box2: ---------- System: Linux, stock kernel 2.4.19 NAT: SNAT all -- any tun2 anywhere anywhere to:172.1.1.3 SNAT all -- any tun3 anywhere anywhere to:172.1.2.3 Where 172.1.1.3 is the local ip of the tunnel to DC1 Where 172.1.2.3 is the local ip of the tunnel to DC2 Now, the problem :-) We mostly do SSH to our DCs. In the simple setup, where we don't do multipath routing (eg, having only one Vtun box), everything is working fine. We can ssh into any machines in any DC without problems.SSH sessions are stable, and stop working only when the NAT ttl has expired. However, when we activate multipath routing, everything goes wrong. For instance: [root@leonard /root]# ssh -l root lime.hosting.kelkoo.net root@lime.hosting.kelkoo.net's password: Read from remote host lime.hosting.kelkoo.net: Connection reset by peer Connection to lime.hosting.kelkoo.net closed. SSH simply don't work anymore, and it's not a netfilter issue, nor any TCP wrapper ACLs. We've checked every firewall rules, and every TCP Wrapper ACLs. Everything is ok. What is weird, is that much simple protocols seems to work fine; eg, doing a telnet to the same host instead of SSH, will work. Same thing if we telnet on the SMTP port for instance, and start simulating an SMTP dialog; it'll work just fine. I also noticed that ping & traceroute ICMP packets works just fine, whatever path they use to reach a DC. However, I think that we are likely to have the same problems with simple protocols as well, if we look a bit deeper, and start heavy testing. Right now, we can't use SSH or FTP with our DC, all sessions will crash just after authentication. Rarely, we can SSH'in successfully through the machines, but the session crash a few minutes after. I'm a bit worry with this situation, because it seems that packets don't use the proper reverse path to come back, although we are NAT'ing everything going out the tunnels ! Maybe the problem comes from the fact that both Vtun box gateways are reachable through the same firewall device, but in that case, I'd like to be sure, before throwing everything out. :) I don't get what's going on there, any help, would be greatly appreciated. Thanks in advance. Best regards, Vincent Jaussaud -- Vincent Jaussaud Kelkoo.com Security Manager email: tatooin@kelkoo.com "The UNIX philosophy is to design small tools that do one thing, and do it well." _______________________________________________ LARTC mailing list / LARTC@mailman.ds9a.nl http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/