Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sep 3, 2014, at 7:00 AM, Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> wrote:

> On Tue, 2 Sep 2014 12:51:06 -0400
> Chris Perl <chris.perl@xxxxxxxxx> wrote:
> 
>> I've noticed that mount.nfs calls bind (in `nfs_bind' in
>> support/nfs/rpc_socket.c) before ultimately calling connect when
>> trying to get a tcp connection to talk to the remote portmapper
>> service (called from `nfs_get_tcpclient' which is called from
>> `nfs_gp_get_rpcbclient').
>> 
>> Unfortunately, this means you need to find a local ephemeral port such
>> that said ephemeral port is not a part of *any* existing TCP
>> connection (i.e. you're looking for a unique 2 tuple of (socket_type,
>> local_port) where socket_type is either SOCK_STREAM or SOCK_DGRAM, but
>> in this case specifically SOCK_STREAM).
>> 
>> If you were to just call connect without calling bind first, then
>> you'd need to find a unique 5 tuple of (socket_type, local_ip,
>> loacl_port, remote_ip, remote_port).
>> 
>> The end result is a misbehaving application that creates many
>> connections to some service, using all ephemeral ports, can cause
>> attempts to mount remote NFS filesystems to fail with EADDRINUSE.
>> 
>> Don't get me wrong, I think we should fix our application, (and we
>> are) but I don't see any reason why mount.nfs couldn't just call
>> connect without calling bind first (thereby allowing it to happen
>> implicitly) and allowing mount.nfs to continue to work in this
>> situation.
>> 
>> I think an example may help explain what I'm talking about.
>> 
>> Lets take a Linux machine running CentOS 6.5
>> (2.6.32-431.1.2.0.1.el6.x86_64) and restrict the number of available
>> ephemeral ports to just 10:
>> 
>> [cperl@localhost ~]$ cat /proc/sys/net/ipv4/ip_local_port_range
>> 60000   60009
>> 
>> Then create a TCP connection to a remote service which will just hold
>> that connection open:
>> 
>> [cperl@localhost ~]$ for in in {0..9}; do socat -u
>> tcp:192.168.1.12:9990 file:/dev/null & done
>> [1] 21578
>> [2] 21579
>> [3] 21580
>> [4] 21581
>> [5] 21582
>> [6] 21583
>> [7] 21584
>> [8] 21585
>> [9] 21586
>> [10] 21587
>> 
>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>> tcp  192.168.1.11:60000  192.168.1.12:9990
>> tcp  192.168.1.11:60001  192.168.1.12:9990
>> tcp  192.168.1.11:60002  192.168.1.12:9990
>> tcp  192.168.1.11:60003  192.168.1.12:9990
>> tcp  192.168.1.11:60004  192.168.1.12:9990
>> tcp  192.168.1.11:60005  192.168.1.12:9990
>> tcp  192.168.1.11:60006  192.168.1.12:9990
>> tcp  192.168.1.11:60007  192.168.1.12:9990
>> tcp  192.168.1.11:60008  192.168.1.12:9990
>> tcp  192.168.1.11:60009  192.168.1.12:9990
>> 
>> And now try to mount an NFS export:
>> 
>> [cperl@localhost ~]$ sudo mount 192.168.1.100:/export/a /tmp/a
>> mount.nfs: Address already in use
>> 
>> As mentioned before, this is because bind is trying to find a unique 2
>> tuple of (socket_type, local_port) (really I believe its the 3 tuple
>> (socket_type, local_ip, local_port), but calling bind with INADDR_ANY
>> as `nfs_bind' does reduces it to the 2 tuple), which it cannot do.
>> 
>> However, just calling connect allows local ephemeral ports to be
>> "reused" (i.e. it looks for the unique 5 tuple of (socket_type,
>> local_ip, local_port, remote_ip, remote_port)).
>> 
>> For example, notice how the local ephemeral ports 60003 and 60004 are
>> "reused" below (because socat is just calling connect, not bind,
>> although we can make socat call bind with an option if we want and see
>> it fail like mount.nfs did above):
>> 
>> [cperl@localhost ~]$ socat -u tcp:192.168.1.12:9991 file:/dev/null &
>> [11] 22433
>> [cperl@localhost ~]$ socat -u tcp:192.168.1.13:9990 file:/dev/null &
>> [12] 22499
>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>> tcp  192.168.0.11:60000  192.168.1.12:9990
>> tcp  192.168.0.11:60001  192.168.1.12:9990
>> tcp  192.168.0.11:60002  192.168.1.12:9990
>> tcp  192.168.0.11:60003  192.168.1.12:9990
>> tcp  192.168.0.11:60003  192.168.1.12:9991
>> tcp  192.168.0.11:60004  192.168.1.12:9990
>> tcp  192.168.0.11:60004  192.168.1.13:9990
>> tcp  192.168.0.11:60005  192.168.1.12:9990
>> tcp  192.168.0.11:60006  192.168.1.12:9990
>> tcp  192.168.0.11:60007  192.168.1.12:9990
>> tcp  192.168.0.11:60008  192.168.1.12:9990
>> tcp  192.168.0.11:60009  192.168.1.12:9990
>> 
>> Is there any reason we couldn't modify `nfs_get_tcpclient' to not bind
>> in the case where its not using a reserved port?
>> 
>> For some color, this is particularly annoying for me because I have
>> extensive automount maps and this failure leads to attempts to access
>> a given automounted path returning ENOENT.  Furthermore, automount
>> caches this failure and continues to return ENOENT for the duration of
>> whatever its negative cache timeout is.
>> 
>> For UDP, I don't think "bind before connect" matters as much.  I
>> believe the difference is just in the error you'll get from either
>> bind or connect (if all ephemeral ports are used).  If you attempt to
>> bind when all local ports are in use you seem to get EADDRINUSE,
>> whereas when you connect when all local ports are in use you get
>> EAGAIN.

There is only one place where mount.nfs uses connected UDP, which
is nfs_ca_sockname(). But UDP connected sockets are less of a
hazard because they lack a 120 second TIME_WAIT after they are
closed.

>> It could be I'm missing something totally obvious for why this is.  If
>> so, please let me know!

The reason is I didn’t realize you could call connect(2) without
calling bind(2) first on STREAM sockets.

> (cc'ing Chuck since he wrote a lot of that code)
> 
> I'm not sure either. If there was a reason for that, it's likely lost
> to antiquity. In some cases, we really are expected to use reserved
> ports and I think you do have to bind() in order to get one. In the
> non-reserved case though it's likely we could skip binding altogether.
> 
> What would probably be best is to roll up a patch that changes it, and
> propose it on the list.

I’d like to see a prototype, too.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux