Re: [Libtirpc-devel] [PATCH] Do not hold clnt_fd_lock mutex during connect

Ian Kent <ikent@xxxxxxxxxx> · Thu, 19 May 2016 11:43:55 +0800

On Wed, 2016-05-18 at 14:54 -0300, Paulo Andrade wrote:
>   An user  reports  that  their  application  connects to  multiple  servers
> through a rpc interface using  libtirpc. When one of  the servers misbehaves
> (goes down  ungracefully or  has a  delay of  a few  seconds in  the traffic
> flow), it was observed that the traffic from the  client to other servers is
> decreased by  the  traffic  anomaly  of  the failing  server,  i.e.  traffic
> decreases or goes to 0 in all the servers.
> 
>   When investigated further, specifically into the behavior  of the libtirpc
> at the  time of  the issue,  it  was observed  that all  of the  application
> threads specifically interacting with  libtirpc were locked into  one single
> lock inside  the  libtirpc library.  This  was a  race  condition which  had
> resulted in a deadlock and hence the resultant dip/stoppage of traffic.
> 
>   As an experiment, the user removed the libtirpc from the application build
> and used the  standard glibc  library for rpc  communication. In  that case,
> everything worked perfectly even  in the time  of the issue of  server nodes
> misbehaving.

I recommend simplifying this.

It should be a concise description of what is wrong and how this patch resolves
it.

The description of the investigation will probably make reading the history more
difficult when trying to find changes at later times so less is more I think.

> 
> Signed-off-by: Paulo Andrade <pcpa@xxxxxxx>
> ---
>  src/clnt_vc.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/src/clnt_vc.c b/src/clnt_vc.c
> index a72f9f7..2396f34 100644
> --- a/src/clnt_vc.c
> +++ b/src/clnt_vc.c
> @@ -229,27 +229,23 @@ clnt_vc_create(fd, raddr, prog, vers, sendsz, recvsz)
>  	} else
>  		assert(vc_cv != (cond_t *) NULL);
>  
> -	/*
> -	 * XXX - fvdl connecting while holding a mutex?
> -	 */
> +	mutex_unlock(&clnt_fd_lock);
> +
>  	slen = sizeof ss;
>  	if (getpeername(fd, (struct sockaddr *)&ss, &slen) < 0) {
>  		if (errno != ENOTCONN) {
>  			rpc_createerr.cf_stat = RPC_SYSTEMERROR;
>  			rpc_createerr.cf_error.re_errno = errno;-		
> 	mutex_unlock(&clnt_fd_lock);
>  			thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
>  			goto err;
>  		}
>  		if (connect(fd, (struct sockaddr *)raddr->buf, raddr->len) <
> 0){
>  			rpc_createerr.cf_stat = RPC_SYSTEMERROR;
>  			rpc_createerr.cf_error.re_errno = errno;
> -			mutex_unlock(&clnt_fd_lock);
>  			thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
>  			goto err;
>  		}
>  	}
> -	mutex_unlock(&clnt_fd_lock);
>  	if (!__rpc_fd2sockinfo(fd, &si))
>  		goto err;
>  	thr_sigsetmask(SIG_SETMASK, &(mask), NULL);

We will need to review the code in the other clnt_*_create() functions for this
to be a complete resolution for the problem.

Ian

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html