Re: [LARTC] Multiple Internet Connection, established connectiondropping issues

Julian Anastasov <ja@xxxxxx> · Fri, 24 May 2002 15:47:56 +0000 (GMT)



	Hello,

On 23 May 2002, William L. Thomson Jr. wrote:

> I do not believe at the moment they apply to the problem I am having. If
> they are I will go through the process of applying the patch. They

	But you have problems solved only by these patches, read below...

> mostly have to due with what happens when a connection dies. At the
> moment both of my connections are live, and I am having other problems.

> At the moment you can establish a connection using either line without a
> problem. After a little while the connection will be dropped as the
> linux router attempts to use a different interface. The connection will

	With plain kernel this can be true only if you are talking
about masquerade connections.

> only be re-connected after the new route has expired in the route's
> cache. Then the connection can continue on the same interface it was
> originally established under for a period of time until the whole thing
> repeats itself.

	Not strictly on the same interface. Note that:

- the routing functionality is dumb, you request route with
specific keys and the result is resolved path (outde, defsrc, gw...)
that is cached for further use.

- in the plain kernel only the sockets correctly provide the same
keys when resolving route, for example:

	- after connect the socket is "connected", i.e. saddr and
	daddr are known and they are always provided when resolving
	route

	- the user can optionally bind the socket to interface,
	this is the only reason the connection to use only one
	interface (in your words)

	- after routing cache expiration (from any kind, caused from
	user, timer expiration or autogenerated flush on route
	change) the sockets resolve again the route providing
	the keys saved for each connection.

	So, the connected socket should not experience any outage
	when the route is resolved after cache entry expiration
	assuming the routing rules do not change.

	The result is that if the routing rules allow using many
	"internet connections" for the particular "keys" you can
	notice that the remote end will not notice that your
	connections use many links/interfaces. It is allowed
	from the routing rules the user defines. So, don't be
	surprised from the routing rules in the Christoph's
	nano howto. You must define correct routing rules for
	all networks you are using, some of them use multipath
	routes because it is allowed the specified subnet to
	use multiple gateways (trick used mostly for NAT because
	you can use "internet connections" with distinct IP
	ranges in same multipath route). Note that if some
	ISPs allow spoofing from your side, then you can put
	all such ISPs as alternative in all other routes for
	the other IP ranges that do not have this luck.

- what the plain kernel does not guarantee is that the masquerading
software does not use same/similar route resolving. I.e. the
masq software assumes the route subsystem will always keep the
cache valid and will never forget the cached routes during the
masq connection life. If you reduce the gc_* values you can notice
that this assumption fails very badly with low values, only the
default ones (5mins or so) legitimate the masq software as a
"correct user" of the (multipath) routing because the conns usually
don't last so much. This is where the patches help: they fix this behavior 
by allowing the masq software to know about the "routing cache expiration" 
problem and to provide the correct routing keys, at least, the interesting
ones - the masq conns are not strictly bound to interface which
is the same default behavior for the connected socket. Why not
to use different interface when the routing rules allow it?

> What I would like to happen is if a connection from the outside is
> established, the linux router will continue to keep the connection on
> the same interface until the user drops the connection. At which time if

	Then use the patches. You can't control the input interface
or more exactly, the input "internet connection" (at least without
using DNS). What eventually breaks in unpatched kernel is the in->out 
traffic of these connections established from remote hosts (I mean the NAT 
ones).

> the connection is re-established by the user the linux router then could
> use a different interface but it should only use the interface that the
> connection was established under.

	Problem in the routing rules or in the NAT code in unpatched
kernel.

> I have made some progress playing around with the gc_interval gc_timeout
> and gc_elasticity.

	I assume you have progress with the cache expiration but
it should be worse for the stability of the NAT conns in the
unpatched kernel.

> I also started looking at the code behind the routing and gc logic in
> the route.c file in the linux kernel source. It appears that the gc does
> not compare which routes to keep or which routes to drop. It just does
> sort of a first come first out sweep.

	Forget the routing cache, at least, tune it for fast expiration
according to your needs for fast failover but nothing more. The NAT
code and the sockets should not care about routing cache and the
expiration. They can use pointer to the cache entry and can know
that the route is still "valid". Or they can simply resolve route
again, always when needed, by supplying the correct keys.

> So aside from the route caching issue I really want to know why in the
> middle of a connection the router will stop using one interface and
> begin using another.

	May be because the route is not resolved with the right keys.
Is that a NAT conn?

> So each interface has a table showing which gateway to use. Those tables
> have a higher priority and should be checked/used before the multipath
> table. As per examples in both
>
> http://www.linuxvirtualserver.org/~julian/nano.txt
> http://www.linuxvirtualserver.org/~julian/dgd-usage.txt
>
> Each of those examples do things a little differently so that alone is
> fairly confusing.

	Hm, we still don't know your routes, may be they are
"little different" from the expected ones that should work :)
What is so different in your setup? These routing rules looks
valid, they work. The logic to place the routing rules to
directly connected networks before all routes via gateways is
a recommendation that works in 90% of the cases. It allows
the setup to rely on rp_filter, etc. You must be able to define
the routing settings by relying on the following rules:

- "each interface has a table showing which gateway to use" does
not mean anything. You can attach different IPs from distinct ISP's
IP ranges to same interface.

- build your routing rules on the "subnet->subnet" assumption,
i.e. how should be routed traffic from one subnet to another
one. Set rp_filter (sometimes other requirement do not allow using 
rp_filter) to keep the traffic symmetric.

- provide "from all to subnet" routes to allow the source address
autoselection mechanism to work when unbound socket talks to
the remote subnet

- always define preferred source IP in the routes and don't rely
very much on the src IP autoselection mechanism always to select
the right local IP address. The multipath routes used from NAT-ed
subnets usually ignore the preferred source IP (usually in the
patched kernel) because there is a 2nd stage where the src IP
is selected after using the gateway from step 1.

> Bottom line I am stuck at the moment with it dropping and
> re-establishing connections no matter what I do. I have been able to
> decrease the time it takes to re-establish, but I believe that also
> effects the time the connection will last for before needing to be
> re-established.

	The reduced gc timeouts lead to more reroutings.

> On the other hand, if a request is sent from inside to the outer world
> without having a connection from the outside world previously
> established it then could use the multipath route.

	Yes, the multipath route is used if the route with the
supplied keys is not already cached.

> I would assume the routing tables, rules, and their priorities would
> effect this but it does not seem to do the job.

	With the patches applied you can notice that the multipath
route is used only when establishing the NAT conn, then it is not
used because the conn is already bound to public IP and we don't
have the right to use it. I.e. the multipath route is used only
for outdev/gateway selection, not for selecting src IP. Later
the right src IP for this gateway is selected. Note that if NAT
is implemented so, we can optimize the path to use only one
route lookup for the SNAT-ed traffic because the established
NAT conn has the knowledge that it is bound to specific src IP.

> I also tried to turn off rp_filter, just to see if it could come in one
> interface and go out another. Once again no noticeable difference. I did
> not try setting the rp_filter to 2, for increased filtering.

	2 is not different from 1 in current kernels.

> Has anybody else experienced this or know of a solution. My gut is
> telling me that the logic in the route cache gc might need to be
> improved, but I am not sure that is really it or not. Knowing my luck
> the solution is right under my nose and I am missing it.

	The problems are complex, may be part of them can be solved
with parameter tuning.

> FYI,
> 	I am currently using kernel 2.2.19 that comes as part of the old Linux
> Router Project version 2.9.8. I have made attempts to compile my own
> kernel, but have not had any luck booting off of it due to the use of a
> ram disc as the root file system. Basically I wanted to replace the
> Linux Router Kernel with mine, but seem to run into problems mounting
> the root file system. Errors and problems for another list. If the
> version of my kernel is the problem or if I need Julian's patches that

	I can only recommend you to try them, from the listed problems,
the problem with the inaccurate rerouting can be solved. The problems
with the faster failover are independent. You can try another approach:
when your healthchecking script detect "internet connection" failure
you can just flush the route cache and be happy. Then the established
connections that can select another alive path will use it, the
connections that have only one alternative (for example, caused from
only one valid path for the selected IP range) then they block or
are dropped from your router with some IP stack mechanism (ICMP
Dest unreachable or so). The forced cache flush will allow the new
conns not to use cached paths between the local and the remote
hosts and by this way all the new conns will select alive path
and with the help from the socket and the NAT code these paths will
be correctly used and no new/unexpected "interfaces" will be
selected.

> is the route I will take. Also if it matters I am using a 75mhz pentium
> with 24mb ram. Using a 16 mb ram disk, with 8mb left over to use for
> memory. It seems to screem, but my gc may be delayed by the processor
> not being able to keep up?

	Good, but how looks the traffic? KBits, may be MBits? How
many entries you see in the routing cache?

Regards

--
Julian Anastasov <ja@xxxxxx>

Re: [LARTC] Multiple Internet Connection, established connectiondropping issues

Linux Advanced Routing and Traffic Control