ibtirpc connect deadlock (part 1)

Attila Kovacs <attila.kovacs@xxxxxxxxxxxxxxx> · Wed, 20 Jul 2022 14:15:08 -0400

Hi,

We are using the tirpc library in an MT environment. We are experiencing 
occasional deadlocks when calling clnt_create_timed() concurrently from 
multiple threads. When this happens, all connect threads hang, with each 
thread taking up 100% CPU.

I was peeking at the code (release 1.3.2), and some potential problems I 
see is how waiting / signaling is implemented on cu->cu_fd_lock.cv in 
clnt_dg.c and clnt_vc.c.

1. In cnlt_dg_freeres() and clnt_vc_freeres(), cond_signal() is called 
after unlocking the mutex (clnt_fd_lock). The manual of 
pthread_cond_signal() allows that, but mentions that for consistent 
scheduling, cond_signal() should be called with the waiting mutex 
locked.  (i.e. lock -> signal -> unlock, rather than lock -> unlock -> 
signal).

One of the dangers of signaling with the unlocked mutex, is that in MT, 
another thread can lock the mutex before the signal call is made, and 
potentially destroy the condition variable (e.g. an asynchronous call to 
clnt_*_destroy()). Thus, by the time the signal() is called in 
clnt*_freeres(), both the condition may be invalid.

The proper sequence here should be:

	[...]
	mutex_lock(&clnt_fd_lock);
	while (ct->ct_fd_lock->active)
		cond_wait(&ct->ct_fd_lock->cv, &clnt_fd_lock);
	xdrs->x_op = XDR_FREE;
	dummy = (*xdr_res)(xdrs, res_ptr);
	thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
	cond_signal(&ct->ct_fd_lock->cv);
	mutex_unlock(&clnt_fd_lock);

2. Similar issue in the macro release_fd_lock() in both the vc and dg 
sources. Here again the sequence ought to be:

#define release_fd_lock(fd_lock, mask) {	\
	mutex_lock(&clnt_fd_lock);	\
	fd_lock->active = FALSE;	\
	thr_sigsetmask(SIG_SETMASK, &(mask), (sigset_t *) NULL); \
	cond_signal(&fd_lock->cv);	\
	mutex_unlock(&clnt_fd_lock);	\
}

3. The use of cond_wait() in clnt_dg.c and clnt_vc.c is also unprotected 
against the asynchronous destruction of the condition variable, since 
cond_wait() releases the mutex as it enters the waiting state. So, when 
it is called again in the while() loop, the condtion might no longer 
exist. Thus, before calling cond_wait(), one should check that the 
client is valid (not destroyed) inside the wait loop.

I'm not sure if these particular issues are the cause of the deadlocks 
we see, but I think they are problematic enough on their own, and 
perhaps should be fixed in the next release.

Thanks,

-- Attila