Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)

Mateusz Guzik <mjguzik@xxxxxxxxx> · Sat, 25 May 2024 08:00:15 +0200

On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> Percpu counter's compare and add are separate functions: without locking
> around them (which would defeat their purpose), it has been possible to
> overflow the intended limit.  Imagine all the other CPUs fallocating
> tmpfs huge pages to the limit, in between this CPU's compare and its add.
> 
> I have not seen reports of that happening; but tmpfs's recent addition
> of dquot_alloc_block_nodirty() in between the compare and the add makes
> it even more likely, and I'd be uncomfortable to leave it unfixed.
> 
> Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> 

I think the posted (and by now landed) code is racy.

I had seen there was a follow up patch which further augmented the
routine, but it did not alter the issue below so I'm responding to this
thread.

> +/*
> + * Compare counter, and add amount if the total is within limit.
> + * Return true if amount was added, false if it would exceed limit.
> + */
> +bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> +				  s64 limit, s64 amount, s32 batch)
> +{
> +	s64 count;
> +	s64 unknown;
> +	unsigned long flags;
> +	bool good;
> +
> +	if (amount > limit)
> +		return false;
> +
> +	local_irq_save(flags);
> +	unknown = batch * num_online_cpus();
> +	count = __this_cpu_read(*fbc->counters);
> +
> +	/* Skip taking the lock when safe */
> +	if (abs(count + amount) <= batch &&
> +	    fbc->count + unknown <= limit) {
> +		this_cpu_add(*fbc->counters, amount);
> +		local_irq_restore(flags);
> +		return true;
> +	}
> +

Note the value of fbc->count is *not* stabilized.

> +	raw_spin_lock(&fbc->lock);
> +	count = fbc->count + amount;
> +
> +	/* Skip percpu_counter_sum() when safe */
> +	if (count + unknown > limit) {
> +		s32 *pcount;
> +		int cpu;
> +
> +		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> +			pcount = per_cpu_ptr(fbc->counters, cpu);
> +			count += *pcount;
> +		}
> +	}
> +
> +	good = count <= limit;
> +	if (good) {
> +		count = __this_cpu_read(*fbc->counters);
> +		fbc->count += count + amount;
> +		__this_cpu_sub(*fbc->counters, count);
> +	}
> +
> +	raw_spin_unlock(&fbc->lock);
> +	local_irq_restore(flags);
> +	return good;
> +}
> +

Consider 2 cpus executing the func at the same time, where one is
updating fbc->counter in the slow path while the other is testing it in
the fast path.

cpu0						cpu1
						if (abs(count + amount) <= batch &&				
						    fbc->count + unknown <= limit)
/* gets past the per-cpu traversal */
/* at this point cpu0 decided to bump fbc->count, while cpu1 decided to
 * bump the local counter */
							this_cpu_add(*fbc->counters, amount);
fbc->count += count + amount;

Suppose fbc->count update puts it close enough to the limit that an
addition from cpu1 would put the entire thing over said limit.

Since the 2 operations are not synchronized cpu1 can observe fbc->count
prior to the bump and update it's local counter, resulting in
aforementioned overflow.

Am I misreading something here? Is this prevented?

To my grep the only user is quotas in shmem and I wonder if that use is
even justified. I am aware of performance realities of atomics. However,
it very well may be that whatever codepaths which exercise the counter
are suffering multicore issues elsewhere, making an atomic (in a
dedicated cacheline) a perfectly sane choice for the foreseeable future.
Should this be true there would be 0 rush working out a fixed variant of
the routine.