Re: [PATCH 5.4 1/2] epoll: call final ep_events_available() check under the lock

Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx> · Thu, 24 Nov 2022 04:48:07 -0300

On Thu, Nov 24, 2022 at 12:11:22AM +0000, Rishabh Bhatnagar wrote:
> From: Roman Penyaev <rpenyaev@xxxxxxx>
> 
> Commit 65759097d804d2a9ad2b687db436319704ba7019 upstream.
> 
> There is a possible race when ep_scan_ready_list() leaves ->rdllist and
> ->obflist empty for a short period of time although some events are
> pending.  It is quite likely that ep_events_available() observes empty
> lists and goes to sleep.
> 
> Since commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of
> nested epoll") we are conservative in wakeups (there is only one place
> for wakeup and this is ep_poll_callback()), thus ep_events_available()
> must always observe correct state of two lists.
> 
> The easiest and correct way is to do the final check under the lock.
> This does not impact the performance, since lock is taken anyway for
> adding a wait entry to the wait queue.
> 
> The discussion of the problem can be found here:
> 
>    https://lore.kernel.org/linux-fsdevel/a2f22c3c-c25a-4bda-8339-a7bdaf17849e@xxxxxxxxxx/
> 
> In this patch barrierless __set_current_state() is used.  This is safe
> since waitqueue_active() is called under the same lock on wakeup side.
> 
> Short-circuit for fatal signals (i.e.  fatal_signal_pending() check) is
> moved to the line just before actual events harvesting routine.  This is
> fully compliant to what is said in the comment of the patch where the
> actual fatal_signal_pending() check was added: c257a340ede0 ("fs, epoll:
> short circuit fetching events if thread has been killed").
> 
> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
> Reported-by: Jason Baron <jbaron@xxxxxxxxxx>
> Reported-by: Randy Dunlap <rdunlap@xxxxxxxxxxxxx>
> Signed-off-by: Roman Penyaev <rpenyaev@xxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Reviewed-by: Jason Baron <jbaron@xxxxxxxxxx>
> Cc: Khazhismel Kumykov <khazhy@xxxxxxxxxx>
> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Link: http://lkml.kernel.org/r/20200505145609.1865152-1-rpenyaev@xxxxxxx
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Rishabh Bhatnagar <risbhat@xxxxxxxxxx>

Acked-by: Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx>

I ended up picking these two fixes to our kernels as well, even though we could
not pinpoint the process kernel stacktrace as you did as a way to determine the
failure has happened. We are still testing that this is really fixed with these
two commits.

On the other hand,
tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c epoll61 test
starts passing once these two commits are applied.

Cascardo.

> ---
>  fs/eventpoll.c | 47 +++++++++++++++++++++++++++--------------------
>  1 file changed, 27 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 7e11135bc915..e5496483a882 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1905,33 +1905,31 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		init_wait(&wait);
>  		wait.func = ep_autoremove_wake_function;
>  		write_lock_irq(&ep->lock);
> -		__add_wait_queue_exclusive(&ep->wq, &wait);
> -		write_unlock_irq(&ep->lock);
> -
>  		/*
> -		 * We don't want to sleep if the ep_poll_callback() sends us
> -		 * a wakeup in between. That's why we set the task state
> -		 * to TASK_INTERRUPTIBLE before doing the checks.
> +		 * Barrierless variant, waitqueue_active() is called under
> +		 * the same lock on wakeup ep_poll_callback() side, so it
> +		 * is safe to avoid an explicit barrier.
>  		 */
> -		set_current_state(TASK_INTERRUPTIBLE);
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +
>  		/*
> -		 * Always short-circuit for fatal signals to allow
> -		 * threads to make a timely exit without the chance of
> -		 * finding more events available and fetching
> -		 * repeatedly.
> +		 * Do the final check under the lock. ep_scan_ready_list()
> +		 * plays with two lists (->rdllist and ->ovflist) and there
> +		 * is always a race when both lists are empty for short
> +		 * period of time although events are pending, so lock is
> +		 * important.
>  		 */
> -		if (fatal_signal_pending(current)) {
> -			res = -EINTR;
> -			break;
> +		eavail = ep_events_available(ep);
> +		if (!eavail) {
> +			if (signal_pending(current))
> +				res = -EINTR;
> +			else
> +				__add_wait_queue_exclusive(&ep->wq, &wait);
>  		}
> +		write_unlock_irq(&ep->lock);
>  
> -		eavail = ep_events_available(ep);
> -		if (eavail)
> -			break;
> -		if (signal_pending(current)) {
> -			res = -EINTR;
> +		if (eavail || res)
>  			break;
> -		}
>  
>  		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
>  			timed_out = 1;
> @@ -1952,6 +1950,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  	}
>  
>  send_events:
> +	if (fatal_signal_pending(current)) {
> +		/*
> +		 * Always short-circuit for fatal signals to allow
> +		 * threads to make a timely exit without the chance of
> +		 * finding more events available and fetching
> +		 * repeatedly.
> +		 */
> +		res = -EINTR;
> +	}
>  	/*
>  	 * Try to transfer events to user space. In case we get 0 events and
>  	 * there's still timeout left over, we go trying again in search of
> -- 
> 2.37.1
>