Re: [PATCH V3 2/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 27 Jun 2024 21:18:35 +0200

On 27/06/2024 20.45, Shakeel Butt wrote:
On Thu, Jun 27, 2024 at 04:32:03AM GMT, Yosry Ahmed wrote:
On Thu, Jun 27, 2024 at 3:33 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
[...]

The reason why I suggested that the completion live in struct cgroup
is because there is a chance here that the flush completes and another
irrelevant flush starts between reading cgrp_rstat_ongoing_flusher and
calling wait_for_completion_interruptible_timeout().

I didn't add this per cgroup because I fear the race of adding a 
wait_for_completion on a cgroup that gets stuck there, but looking at 
the code the completion API should be able to avoid this.

Yes this can happen if flusher for irrelevant cgroup calls
reinit_completion() while the initial flusher was just about to call
wait_for_completion_interruptible_timeout().

Restoring two main functions to assist reviewer seeing the race:

On 26/06/2024 23.18, Jesper Dangaard Brouer wrote:
> +#define MAX_WAIT	msecs_to_jiffies(100)
> +/* Trylock helper that also checks for on ongoing flusher */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
> +{
> +retry:
> +	bool locked = __cgroup_rstat_trylock(cgrp, -1);
> +	if (!locked) {
> +		struct cgroup *cgrp_ongoing;
> +
> +		/* Lock is contended, lets check if ongoing flusher is
> +		 * taking care of this, if we are a descendant.
> +		 */
> +		cgrp_ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> +		if (!cgrp_ongoing)
> +			goto retry;
> +

Long wait/race here, can cause us to see an out-dated cgrp_ongoing.
And then another CPU manage to reach reinit_completion() below, before
execution continues here.

> +		if (cgroup_is_descendant(cgrp, cgrp_ongoing)) {
> +			wait_for_completion_interruptible_timeout(
> +				&cgrp_rstat_flusher_done, MAX_WAIT);
> +
> +			return false;
> +		}
> +		__cgroup_rstat_lock(cgrp, -1, false);
> +	}
> +	/* Obtained lock, record this cgrp as the ongoing flusher */
> +	reinit_completion(&cgrp_rstat_flusher_done);
> +	WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp);
> +
> +	return true; /* locked */
> +}
> +
> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
> +{
> +	WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
> +	complete_all(&cgrp_rstat_flusher_done);
> +	__cgroup_rstat_unlock(cgrp, -1);
> +}

This will cause the caller to wait for an irrelevant flush, which may
be fine because today the caller would wait for the lock anyway. Just
mentioning this in case you think this may happen enough to be a
problem.

Yes, it would wait for an irrelevant flush.

Actually, I think this can happen beyond the window I described above.
I think it's possible that a thread waits for the flush, then gets
woken up when complete_all() is called, but another flusher calls
reinit_completion() immediately. The woken up thread will observe
completion->done == 0 and go to sleep again.

I don't think it will go to sleep again as there is no retry.

I think most of these cases can be avoided if we make the completion
per cgroup. It is still possible to wait for more flushes than
necessary, but only if they are for the same cgroup.

Yeah, per-cgroup completion would avoid the problem of waiting for
irrelevant flush.

Great, I will code up a version with per-cgroup completion.

--Jesper