On Sep 3, 2014, at 7:00 AM, Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> wrote: > On Tue, 2 Sep 2014 12:51:06 -0400 > Chris Perl <chris.perl@xxxxxxxxx> wrote: > >> I've noticed that mount.nfs calls bind (in `nfs_bind' in >> support/nfs/rpc_socket.c) before ultimately calling connect when >> trying to get a tcp connection to talk to the remote portmapper >> service (called from `nfs_get_tcpclient' which is called from >> `nfs_gp_get_rpcbclient'). >> >> Unfortunately, this means you need to find a local ephemeral port such >> that said ephemeral port is not a part of *any* existing TCP >> connection (i.e. you're looking for a unique 2 tuple of (socket_type, >> local_port) where socket_type is either SOCK_STREAM or SOCK_DGRAM, but >> in this case specifically SOCK_STREAM). >> >> If you were to just call connect without calling bind first, then >> you'd need to find a unique 5 tuple of (socket_type, local_ip, >> loacl_port, remote_ip, remote_port). >> >> The end result is a misbehaving application that creates many >> connections to some service, using all ephemeral ports, can cause >> attempts to mount remote NFS filesystems to fail with EADDRINUSE. >> >> Don't get me wrong, I think we should fix our application, (and we >> are) but I don't see any reason why mount.nfs couldn't just call >> connect without calling bind first (thereby allowing it to happen >> implicitly) and allowing mount.nfs to continue to work in this >> situation. >> >> I think an example may help explain what I'm talking about. >> >> Lets take a Linux machine running CentOS 6.5 >> (2.6.32-431.1.2.0.1.el6.x86_64) and restrict the number of available >> ephemeral ports to just 10: >> >> [cperl@localhost ~]$ cat /proc/sys/net/ipv4/ip_local_port_range >> 60000 60009 >> >> Then create a TCP connection to a remote service which will just hold >> that connection open: >> >> [cperl@localhost ~]$ for in in {0..9}; do socat -u >> tcp:192.168.1.12:9990 file:/dev/null & done >> [1] 21578 >> [2] 21579 >> [3] 21580 >> [4] 21581 >> [5] 21582 >> [6] 21583 >> [7] 21584 >> [8] 21585 >> [9] 21586 >> [10] 21587 >> >> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5 >> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t >> tcp 192.168.1.11:60000 192.168.1.12:9990 >> tcp 192.168.1.11:60001 192.168.1.12:9990 >> tcp 192.168.1.11:60002 192.168.1.12:9990 >> tcp 192.168.1.11:60003 192.168.1.12:9990 >> tcp 192.168.1.11:60004 192.168.1.12:9990 >> tcp 192.168.1.11:60005 192.168.1.12:9990 >> tcp 192.168.1.11:60006 192.168.1.12:9990 >> tcp 192.168.1.11:60007 192.168.1.12:9990 >> tcp 192.168.1.11:60008 192.168.1.12:9990 >> tcp 192.168.1.11:60009 192.168.1.12:9990 >> >> And now try to mount an NFS export: >> >> [cperl@localhost ~]$ sudo mount 192.168.1.100:/export/a /tmp/a >> mount.nfs: Address already in use >> >> As mentioned before, this is because bind is trying to find a unique 2 >> tuple of (socket_type, local_port) (really I believe its the 3 tuple >> (socket_type, local_ip, local_port), but calling bind with INADDR_ANY >> as `nfs_bind' does reduces it to the 2 tuple), which it cannot do. >> >> However, just calling connect allows local ephemeral ports to be >> "reused" (i.e. it looks for the unique 5 tuple of (socket_type, >> local_ip, local_port, remote_ip, remote_port)). >> >> For example, notice how the local ephemeral ports 60003 and 60004 are >> "reused" below (because socat is just calling connect, not bind, >> although we can make socat call bind with an option if we want and see >> it fail like mount.nfs did above): >> >> [cperl@localhost ~]$ socat -u tcp:192.168.1.12:9991 file:/dev/null & >> [11] 22433 >> [cperl@localhost ~]$ socat -u tcp:192.168.1.13:9990 file:/dev/null & >> [12] 22499 >> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5 >> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t >> tcp 192.168.0.11:60000 192.168.1.12:9990 >> tcp 192.168.0.11:60001 192.168.1.12:9990 >> tcp 192.168.0.11:60002 192.168.1.12:9990 >> tcp 192.168.0.11:60003 192.168.1.12:9990 >> tcp 192.168.0.11:60003 192.168.1.12:9991 >> tcp 192.168.0.11:60004 192.168.1.12:9990 >> tcp 192.168.0.11:60004 192.168.1.13:9990 >> tcp 192.168.0.11:60005 192.168.1.12:9990 >> tcp 192.168.0.11:60006 192.168.1.12:9990 >> tcp 192.168.0.11:60007 192.168.1.12:9990 >> tcp 192.168.0.11:60008 192.168.1.12:9990 >> tcp 192.168.0.11:60009 192.168.1.12:9990 >> >> Is there any reason we couldn't modify `nfs_get_tcpclient' to not bind >> in the case where its not using a reserved port? >> >> For some color, this is particularly annoying for me because I have >> extensive automount maps and this failure leads to attempts to access >> a given automounted path returning ENOENT. Furthermore, automount >> caches this failure and continues to return ENOENT for the duration of >> whatever its negative cache timeout is. >> >> For UDP, I don't think "bind before connect" matters as much. I >> believe the difference is just in the error you'll get from either >> bind or connect (if all ephemeral ports are used). If you attempt to >> bind when all local ports are in use you seem to get EADDRINUSE, >> whereas when you connect when all local ports are in use you get >> EAGAIN. There is only one place where mount.nfs uses connected UDP, which is nfs_ca_sockname(). But UDP connected sockets are less of a hazard because they lack a 120 second TIME_WAIT after they are closed. >> It could be I'm missing something totally obvious for why this is. If >> so, please let me know! The reason is I didn’t realize you could call connect(2) without calling bind(2) first on STREAM sockets. > (cc'ing Chuck since he wrote a lot of that code) > > I'm not sure either. If there was a reason for that, it's likely lost > to antiquity. In some cases, we really are expected to use reserved > ports and I think you do have to bind() in order to get one. In the > non-reserved case though it's likely we could skip binding altogether. > > What would probably be best is to roll up a patch that changes it, and > propose it on the list. I’d like to see a prototype, too. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html