Intermittent Connections Through ip & iptables (RTF)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have reached the limits of my knowledge of TCP/IP.  I have tried to
understand RedHat 7.3 with iproute2-ss010824 and iptables 1.2.5.  What
I have is an intermittent condition where, sometimes, external hosts
can't reach my internal hosts.  I've successfully simplified the prob-
lem down to a repeatable case.

The data below shows a configuration that is not yet secure; I don't
dare "tighten it down" until I have it working reliably.  So, I've
obscured all the ip address ranges in this post.

If you go to http://www.alertra.com and do a "Spot Check" on
http://www.deepwoods.com you'll probably see a report that shows
 some of Alertra's servers can reach my site, but others can't.
For example, I recently did a test using http://bb.bb.bb.27 (a
special IP address to that server that I use only for testing;
it has no associated domain name) and it showed the following results:

http://bb.bb.bb.27
Time (US/Pacific)   Checked From     Result Bytes Seconds
04/17/2003 11:32:27 Detroit USA         OK  26394  1.641
04/17/2003 11:32:56 Frankfurt GERMANY ERROR  N/A  30.023
04/17/2003 11:32:29 London UK           OK  26394  2.556
04/17/2003 11:32:56 Los Angeles USA   ERROR  N/A  30.028
04/17/2003 11:32:56 Montreal CANADA   ERROR  N/A  30.026
04/17/2003 11:32:28 Oklahoma City USA   OK  26394  2.127
Average Response Time: 16.067 Seconds

So, three of their servers get the web page, but three don't
 (and, if you check a public site, like google.com, all servers
 will report "OK").

Here is my router/firewall configuration.  I've followed Bert
Hubert's excellent LARTC "4.2.1 Split access" recommendations for
a dual-DSL environment (and I hope I've done so correctly).

aa.aa.aa.24/29 is WAN1 (DSL Service #1) on eth1
bb.bb.bb.0/29  is WAN2 (DSL Service #2) on eth2
cc.cc.cc.0/24  is LAN  (internal address space) on eth0
cc.cc.cc.12/31 is an SMTP and Lotus Notes server (port 1352)
cc.cc.cc.54/31 is an IIS5 server
cc.cc.cc.64/26 is the LAN's DHCP range (i.e., workstations)
cc.cc.cc.11    is the Router/Firewall

~~~~~~~~~Certain key kernel values~~~~~~~~~~
/proc/sys/net/ipv4/ip_forward = 1
/proc/sys/net/ipv4/conf/all/rp_filter = 0
/proc/sys/net/ipv4/conf/default/rp_filter = 0
/proc/sys/net/ipv4/conf/eth0/rp_filter = 0
/proc/sys/net/ipv4/conf/eth1/rp_filter = 0
/proc/sys/net/ipv4/conf/eth2/rp_filter = 0
/proc/sys/net/ipv4/conf/lo/rp_filter = 0
/proc/sys/net/ipv4/route/gc_timeout = 60
/proc/sys/net/ipv4/route/gc_interval = 60

~~~~~~~~~~~~~~Device Addresses~~~~~~~~~~~~~ip addr show:
   1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
       inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
   2: eth0: <BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 qdisc pfifo_fast qlen
100
       inet cc.cc.cc.11/24 brd cc.cc.cc.255 scope global eth0
   3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
       inet aa.aa.aa.0/29 brd aa.aa.aa.7 scope global eth1
       inet aa.aa.aa.2/32 scope global eth1
       inet aa.aa.aa.3/32 scope global eth1
       inet aa.aa.aa.4/32 scope global eth1
   4: eth2: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
       inet bb.bb.bb.24/29 brd bb.bb.bb.31 scope global eth2
       inet bb.bb.bb.26/32 scope global eth2
       inet bb.bb.bb.27/32 scope global eth2
       inet bb.bb.bb.28/32 scope global eth2

~~~~~~~~~~~~~~~~~~Routes~~~~~~~~~~~~~~~~~~~ip route list:
   aa.aa.aa.0/29 dev eth1  scope link  src aa.aa.aa.0
   bb.bb.bb.24/29 dev eth2  scope link  src bb.bb.bb.24
   cc.cc.cc.0/24 dev eth0  scope link
   127.0.0.0/8 dev lo  scope link
#per LARTC HOWTO 4.2.2 load balancing
   default
      nexthop via aa.aa.aa.1  dev eth1 weight 1
      nexthop via bb.bb.bb.25  dev eth2 weight 1

~~~~~~~~~~~~~~~Routing Rules~~~~~~~~~~~~~~~ip rule list:
       0:   from all lookup local
   32764:   from bb.bb.bb.24/29 lookup WAN2
   32765:   from aa.aa.aa.0/29 lookup WAN1
   32766:   from all lookup main
   32767:   from all lookup default

~~~~~~~~~~~~~~~~Routing Tables~~~~~~~~~~~~~ip route list table WAN*
#per LARTC HOWTO 4.2.1 Split access
  table WAN1:
   aa.aa.aa.0/29 dev eth1  scope link  src aa.aa.aa.0
   cc.cc.cc.0/24 dev eth0  scope link
   127.0.0.0/8 dev lo  scope link
  table WAN2:
   bb.bb.bb.24/29 dev eth2  scope link  src bb.bb.bb.24
   cc.cc.cc.0/24 dev eth0  scope link
   127.0.0.0/8 dev lo  scope link

~~~~~~~~~~~~~~~NAT Rules~~~~~~~~~~~~~~~~~~~iptables -t nat -L -n:
   Chain PREROUTING (policy ACCEPT)
   target prot opt source     destination
# Map all external addresses to internal servers
   DNAT   tcp  --  0.0.0.0/0  aa.aa.aa.2   multiport dports 25,1352
to:cc.cc.cc.12
   DNAT   tcp  --  0.0.0.0/0  bb.bb.bb.26  multiport dports 25,1352
to:cc.cc.cc.13
   DNAT   tcp  --  0.0.0.0/0  aa.aa.aa.3   multiport dports 80,443
to:cc.cc.cc.54
   DNAT   tcp  --  0.0.0.0/0  bb.bb.bb.27  multiport dports 80,443
to:cc.cc.cc.55
# Map any non-tcp stuff to the router/firewall (for testing; allow ping,
etc.)
   DNAT  !tcp  --  0.0.0.0/0  aa.aa.aa.4   to:cc.cc.cc.11
   DNAT  !tcp  --  0.0.0.0/0  bb.bb.bb.28  to:cc.cc.cc.11

   Chain POSTROUTING (policy ACCEPT)
   target prot opt source     destination
   SNAT   all  --  0.0.0.0/0  0.0.0.0/0    to:aa.aa.aa.6
   SNAT   all  --  0.0.0.0/0  0.0.0.0/0    to:bb.bb.bb.30
# Allow internal users to access IIS website through SNAT/DNAT
   SNAT   tcp  --  cc.cc.cc.64/26  cc.cc.cc.54/31  multiport dports 80,443
to:cc.cc.cc.11

~~~~~~~~~~~~~Firewall Rules~~~~~~~~~~~~~~~~~iptables -L -n | colrm:
   Chain INPUT (policy DROP)
   target prot opt source     destination
   ACCEPT all  --  0.0.0.0/0  0.0.0.0/0
   DROP   all  --  0.0.0.0/0  0.0.0.0/0     state INVALID
   ACCEPT all  -f  0.0.0.0/0  0.0.0.0/0
   ACCEPT all  --  0.0.0.0/0  0.0.0.0/0     state NEW
   ACCEPT all  --  0.0.0.0/0  0.0.0.0/0     state RELATED,ESTABLISHED
# Gate traffic from external addresses to Apache server
   ACCEPT tcp  --  0.0.0.0/0  aa.aa.aa.4    multiport dports 80,443
   ACCEPT tcp  --  0.0.0.0/0  bb.bb.bb.28   multiport dports 80,443
# Allow ping tests (for now)
   ACCEPT icmp --  0.0.0.0/0  cc.cc.cc.11   icmp type 0
# Allow internal admins to connect to router/firewall via SSH
   ACCEPT tcp  --  cc.cc.cc.64/26 0.0.0.0/0 tcp dpt:22

   Chain FORWARD (policy DROP)
   target prot opt source     destination
   DROP   all  --  0.0.0.0/0  0.0.0.0/0       state INVALID
   ACCEPT all  -f  0.0.0.0/0  0.0.0.0/0
   ACCEPT all  --  0.0.0.0/0  0.0.0.0/0       state NEW
   ACCEPT all  --  0.0.0.0/0  0.0.0.0/0       state RELATED,ESTABLISHED
# Allow SMTP and Lotus Notes
   ACCEPT tcp  --  0.0.0.0/0  cc.cc.cc.12/31  multiport dports 25,1352
# Allow web site visitors
   ACCEPT tcp  --  0.0.0.0/0  cc.cc.cc.54/31  multiport dports 80,443

   Chain OUTPUT (policy ACCEPT)
   target prot opt source     destination
   ACCEPT all  --  0.0.0.0/0  0.0.0.0/0

To diagnose the problem, I set up a tcpdump monitor on all Ethernet
ports of my router, and then I did a SpotCheck on http://bb.bb.bb.27
(the unpublished IP address for http://www.deepwoods.com, so it's
easier to trace in logs).  Alertra's SpotCheck gave me a summary
(above) with three "OK" and three "ERROR" reports.

Then, I took the tcpdump output and sorted into six different packet
threads, based on Alertra' servers' addresses (too much to include here)

Let me show just one example of an "ERROR' packet sequence:  I have
interspersed my own comments with tcpdump data
#  Time       From                       To       (Packet info)
1      #An Alertra host initiates a SYN to our DSL Router...
   26.455171  g95-120.citenet.net.4545   w027.dsl.(myDSL).http
                     (S 847849283:847849283(0) win 5840
                       <mss 1460,sackOK,timestamp 543018938 0,nop,wscale
0>)
2      #...and that packet is NAT'd, to an internal web server
   26.455354  g95-120.citenet.net.4545   cc.cc.cc.55.http
                     (S 847849283:847849283(0) win 5840
                       <mss 1460,sackOK,timestamp 543018938 0,nop,wscale
0>)
3      #Our web server sends back SYN/ACK...
   26.455622  cc.cc.cc.55.http           g95-120.citenet.net.4545
                     (S 3655927565:3655927565(0) ack 847849284
                       win 64240 <mss 1460,nop,wscale 2,nop,nop,sackOK>)
4      #...and that packet is de-NAT'd on the way back out
   26.455723  w027.dsl.(myDSL).http     g95-120.citenet.net.4545
                     (S 3655927565:3655927565(0) ack 847849284 win 64240
                       <mss 1460,nop,wscale 2,nop,nop,sackOK>)
5      #Three seconds later, without responding to our SYN/ACK, the
                      remote host initiates another SYN sequence...
   29.449171  g95-120.citenet.net.4545  w027.dsl.(myDSL).http
                     (S 847849283:847849283(0) win 5840
                       <mss 1460,sackOK,timestamp 543019238 0,nop,wscale
0>)
6      #...that is NAT'd to our webserver
   29.449295  g95-120.citenet.net.4545  cc.cc.cc.55.http
                     (S 847849283:847849283(0) win 5840
                       <mss 1460,sackOK,timestamp 543019238 0,nop,wscale
0>)
7      #Our server responds ACK... (??? is this the source of the error
???)
   29.449547  cc.cc.cc.55.http          g95-120.citenet.net.4545
                     (. ack 1 win 64240)
8      #...which is de-NAT's back out to Alertra's host
   29.449619  w027.dsl.(myDSL).http     g95-120.citenet.net.4545
                     (. ack 1 win 64240)
9      #Our web server responds to the renewed initial SYN at #5, above...
   29.455217  cc.cc.cc.55.http          g95-120.citenet.net.4545
                     (S 3655927565:3655927565(0) ack 847849284
                       win 64240 <mss 1460,nop,wscale 2,nop,nop,sackOK>)
10     #...which is de-NAT's back out to Alertra's host
  29.455299  w027.dsl.(myDSL).http      g95-120.citenet.net.4545
                     (S 3655927565:3655927565(0) ack 847849284
                       win 64240 <mss 1460,nop,wscale 2,nop,nop,sackOK>)
11     #And Alertra's host, again, attempts to initiate a new
                                           connection with a SYN
  35.44921   g95-120.citenet.net.4545   w027.dsl.(myDSL).http
                     (S 847849283:847849283(0) win 5840
                       <mss 1460,sackOK,timestamp 543019838 0,nop,wscale
0>)

---------|---------|---------|---------|---------|---------|---------|
This exact same pattern appears in the other two "ERROR" connections
(as reported in the spreadsheet).  The same "lost SYN/ACK".  I cannot
determine whether (perhaps) that SYN/ACK is not being sent out on the
DSL line, is not being accepted because of some error at the Alertra
host, or is getting "lost in the cloud."  However, I suspect it's some
obvious, glaring error on my part due to a void in my understanding.

I have virtually ruled out Alertra's hosts as the problem (testing
with, say, http://www.google.com, works fine), and of the six hosts
they use, some report "OK" and some report "ERROR" every time...and
the hosts reporting "ERROR" change with each test.  It seems, sometimes,
 as if my router will accept the first three and then fail the rest...
all on this apparent inability to reliably get the SYN/ACK back to a
connection originator.

I'm hoping that one of you has the experience to see an clear and
evident error or absence in my configuration (or perhaps know of a bug
in one of the product versions I'm using that would commend updating)
and can guide me to a solution to this plaguing problem.

Any help would be most appreciated.

--Carol Anne




[Index of Archives]     [Linux Netfilter Development]     [Linux Kernel Networking Development]     [Netem]     [Berkeley Packet Filter]     [Linux Kernel Development]     [Advanced Routing & Traffice Control]     [Bugtraq]

  Powered by Linux