Hi, Eliezer, On Tue, Sep 10, 2013 at 1:49 AM, Eliezer Croitoru <eliezer@xxxxxxxxxxxx> wrote: > Hey Nickolai, > > I would try to make sense of what you have seen. > The tproxy is a very complex feature which by the kernel cannot bind > double src(ip:port) + dst(ip:port).. > like let say for example the 10.100.1.100 client tries to connect > 2.3.4.5 at port 80. > the client tries once for: > 10.100.1.100:5455 to 2.3.4.5:80 > then let say the client doesn't have the right route and there is a > network problem then the client tries again from: > 10.100.1.100:5456 to 2.3.4.5:80 > the above client have an issue with the network and the proxy knows that.. > the proxy is transparent and needs to re-intercept the same request > twice.. and when the first connection was timedout from the kernel level > then application can drop the connection and do not continue parsing the > request. The problem I'm facing is not related to user to proxy connection at all. With proper network setup this works flawlessly. It's the proxy to server connection when squid tries to bind to an IP, without specifying a port, thus leaving the kernel to choose one. > the kernel can bind the ip:port of the src to the dst if it knows that > all 80 port traffic is using only the traffic as a route. > in a case this is not the case the client will have troubles and hence a > binding of ip:port to ip:port from the network layer will be a disaster > for couple layers.. Yeah! ip:port pairs have to be unique :-) > SO the kernel manages what the bind will be like.. > I dont see how a tproxy enabled system for more then 10,000 cilents can > reach a critical level of commbind unless the cpu and all the lower > levels of the kernel will not be able to handle this level of traffic. It's not about number of users, but number of simultaneous live connections from the cache server. Have in mind "idle" http connections are "live" tcp streams. > if it's the range thing from the kernel it can be reproduced in a matter > of seconds by lowering it.. Exactly. Try something like echo 32768 32867 > /proc/sys/net/ipv4/ip_local_port_range and you'll start getting EADDRINUSE on the 101st parallel outbound connection of squid. > This limit is not a rule for the application but it limits the kernel to > what local-ip:port bind when the source machine is the local machine. > this doesn't force the kernel to handle lower amount of connections but > allows the kernel to do less lookup when trying to find a free ip:port > socket to bind to the new connection. > > it seems to me like you are using connection tracking on a tproxy system > that doesn't need to do connection tracking at all in this kind of scale.. > There is no reason for a tproxy system to keep track on connections of > the client for more then 5-10 minutes tops.. > > try to look more into the connection tracking rather then the basic > kernel lands.. Nope. The problem has nothing to do with TPROXY, nor connection tracking. It's in the port auto-selection algorithm of the kernel that limits the number of live auto-selected ports to ip_local_port_range.max - ip_local_port_range.min. Here's some pseudocode to reproduce it, even with local addresses assigned to the host: ===[cut]=== $broken = true; // ask the kernel to select port $port_min = ip_local_port_range.min; $port_max = ip_local_port_range.max; $ips_to_test_with = {'aaa.aaa.aaa.aaa', 'bbb.bbb.bbb.bbb'); function socket_setup($ip, $port) { $socket = new socket(AF_INET, SOCK_STREAM, SOL_TCP); $socket.set_option(SOL_SOCKET, SO_REUSEADDR, 1); $socket.set_option(SOL_IP, IP_TRANSPARENT, 1); // needed only if $ips_to_test_with are not assigned to the host $socket.bind($ip, $port); $socket.listen(); // listen is easier and faster for testing, we have to just block this socket in the kernel somehow. in the real life it will be a $socket.connect. return $socket; } for ($port = $socket_min; $port <= $socket_max; $port++) { foreach ($ips_to_test_with as $ip) { if ($broken) { // will produce exception when $port = floor(($socket_max - $socket_max) / count($ips_to_test_with)) +1 socket_setup($ip, 0); } else { // will assign all the ports socket_setup($ip, $port); } } } ===[cut]=== That's it. Do echo 32768 32867 > /proc/sys/net/ipv4/ip_local_port_range in try it. Once with $broken = true, and then again with $broken = false. When $broken = true on the 51st port assignment on IP address aaa.aaa.aaa.aaa you'll get EADDRINUSE. When $broken = false you'll get both aaa.aaa.aaa.aaa and bbb.bbb.bbb.bbb listening to 100 ports each and no error. Hope this time it's more clear. Best, Niki