Hello, This is a follow up to my last email. The fact that AutoFS now probes the proximity of single servers exposed an interesting problem. One way to reproduced it is to setup a map with 10 or more direct mounts. These volumes must be hosted at the same NFS server. So lets say we have /autofs-race/dir{1,2,3,4...} and each volume has a file named 'file'. By triggering the mount of the 10 volumes simultaneously, you'll notice that some of them will fail to mount: n43:~ # for i in $(seq 1 10); do stat /autofs-race/dir$i/file > /dev/null & done (.. shell prints 10 pids that are running in background ..) n43:~ # stat: cannot stat ‘/autofs-race/dir4/file’: No such file or directory stat: cannot stat ‘/autofs-race/dir10/file’: No such file or directory [1] Done stat /autofs-race/dir$i/file > /dev/null [2] Done stat /autofs-race/dir$i/file > /dev/null [3] Done stat /autofs-race/dir$i/file > /dev/null [4] Exit 1 stat /autofs-race/dir$i/file > /dev/null [5] Done stat /autofs-race/dir$i/file > /dev/null of [6] Done stat /autofs-race/dir$i/file > /dev/null [7] Done stat /autofs-race/dir$i/file > /dev/null [8] Done stat /autofs-race/dir$i/file > /dev/null [9]- Done stat /autofs-race/dir$i/file > /dev/null [10]+ Exit 1 stat /autofs-race/dir$i/file > /dev/null Here it failed to mount /autofs-race/dir4 and /autofs-race/dir10. I've investigated this and discovered that: * The problem happens because prune_host_list() removes the only host from the hosts' list. get_nfs_info() succeeds for some protocols but eventually receives an ETIMEOUT and returns the host doesn't support any protocol, hence get_vers_and_cost() fails. * It only happens when when the RPC clients are created by clnt_dg_create() or clnt_vc_create()) (the default when libtirpc is used). However, if I keep building with libtirpc and change only these calls to clntudp_bufcreate() and clnttcp_create() respectively, the problem doesn't happen. * The transport protocol in the conn_info structure seems to change behind its feet. This is very strange, and I may be doing something wrong here, but using debug code such as the snippet below in nfs_get_info() can demonstrate it: + x = rpc_info->proto->p_proto; if (rpc_info->proto->p_proto == IPPROTO_UDP) status = rpc_udp_getclient(rpc_info, NFS_PROGRAM, NFS3_VERSION); else status = rpc_tcp_getclient(rpc_info, NFS_PROGRAM, NFS3_VERSION); + if (x != rpc_info->proto->p_proto) + logmsg("%lu p_proto changed (%d -> %d)", pthread_self(), + x, rpc_info->proto->p_proto); * Triggering the mounts simultaneously is required to reproduce the problem, which makes me think if something here (libtirpc for example) is not really thread safe or if the RPC clients must be destroyed after every use to avoid such issues. * Another theory I wasn't able to test yet is if due to some build/link issue, some RPC functions from glibc are still being used, even when libtirpc is available. Is it possible? Could the mix cause the problem? I planned to investigate more to provide a better report and perhaps a fix, but as I'm not making much progress in the last couple of days, I'm reporting it now. Thanks, Leonardo -- To unsubscribe from this list: send the line "unsubscribe autofs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html