Re: Possible slab-allocator issue

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 23 Jul 2024 11:05:16 +0200

On 7/22/24 11:38 PM, Paul E. McKenney wrote:
> Hello!

Hi!

> I have been chasing a hang [1] in rcuscale, which is a kernel module
> that does performance testing for RCU updaters.  I have a good long-term
> workaround for this problem, namely an extremely stupid custom allocator,
> so I don't need a fix, but I am nevertheless posting this just in case
> this affects others.  Or in case I am doing something stupid that I
> should avoid doing in the future.
> 
> I first encountered this on RCU's "dev" branch, but the problem also
> exists in current mainline [2].  One high-probability reproducer for
> this hang, tested on both AMD and Intel x86, is as follows:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000" --trust-make
> 
> This script builds a kernel, requires that virtualization be enabled,
> and uses the qemu-system-x86_64 command.  I have not run this on a system
> with fewer than 16 CPUs.
> 
> When it fails, this takes a bit less than ten minutes to fail, which
> results in a non-zero exit code from kvm.sh.  Rare successes normally
> consume kernel-build time plus less than a minute of run time.
> 
> The failure mode is CPU 0 (and always CPU 0) failing to get out of this
> (admittedly ugly, since reworked) "retry" loop in the rcu_scale_writer()
> function in kernel/rcu/rcuscale.c:
> 
>  1 retry:
>  2   if (!rhp)
>  3     rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
>  4   if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
>  5     atomic_inc(this_cpu_ptr(&n_async_inflight));
>  6     cur_ops->async(rhp, rcu_scale_async_cb);
>  7     rhp = NULL;
>  8   } else if (!kthread_should_stop()) {
>  9     cur_ops->gp_barrier();
> 10     goto retry;
> 11   } else {
> 12     kfree(rhp); /* Because we are stopping. */

Noted no "rhp = NULL;" but we probably really shouldn't reach this again due
to torture_must_stop() including kthread_should_stop() so it should be fine.

> 13   }
> 
> The loop begins with the "retry" label on line 1, finds "rhp"
> equal to NULL on line 2, gets a NULL from kmalloc(..., GFP_KERNEL)
> on line 3, proceeds to the "else if" clause on lines 8-11,

So it's been verified that rhp is indeed null after kmalloc() and it's not
going to the "else if" because of the rest of the condition?

I wonder how this loop works, it increases a percpu counter
n_async_inflight, and rcu_scale_async_cb() decreases it, but can't the
callback be invoked on a different cpu, making these counters imbalanced?

> invokes rcu_barrier_tasks() on line 9 (because that is what the
> rcuscale.scale_type=tasks kernel boot parameter tells it to do), then
> line 10 jumps back to the retry label.  This repeats thousands of times,
> consuming some minutes.  There is no drama in the console log, as in no
> mention of OOM killer, allocation failures, or much of anything other
> than Tasks RCU switching to per-CPU callback queueing due to the heavy
> load and the expected rcuscale output.

A GFP_KERNEL failure should definitely produce OOM or allocation failure
warning (and it's "too small to fail" to start with :)) Silently returning
NULL is, uh, inconcievable. That would really require a bug in the
implementation, or perhaps some slab structures corruption due to double
free or use after free (although I've seen those always manifest in a
different, much more noisy way).

> Note that all 16 CPUs are looping allocating and invoking either
> call_rcu_tasks() when allocation succeeds or rcu_barrier_tasks() when
> allocation fails.  So CPU 0 could simply be quite unlucky or have other
> work so that when the rcu_barrier_tasks() waits for memory to be released,
> the other CPUs allocate it all before CPU 0's rcu_scale_writer() task
> is awakened.  But again, there is no sign of memory-system drama on
> the console log.
> 
> I am good, as I implemented a stupid per-CPU-array-based allocator,
> which avoids this issue, admittedly quite crudely.

That shouldn't help if it was a problem with n_async_inflight imbalance as I
suggested above but who knows, maybe it's being busy in kmalloc() that
causes some rcu callbacks to be executed elsewhere than on the CPU that
submitted them, and using your allocator avoids that?

> Johannes suggested turning on tracing for the kmem:kmalloc and
> kmem:mm_page_alloc trace events, which I did using this command:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 5 --configs TREE --kconfig "CONFIG_DEBUG_KERNEL=y CONFIG_RCU_TRACE=y CONFIG_FTRACE=y" --bootargs "rcuscale.scale_type=tasks rcuscale.minruntime=5 rcuscale.nreaders=0 rcuscale.nwriters rcuscale.gp_async=1 rcuscale.gp_async_max=100000 trace_event=kmem:kmalloc,kmem:mm_page_alloc trace_buf_size=3k" --trust-make
> 
> I collected the .config file [3] and console output [4].  The console
> output features more than 1500 kmalloc trace events and not quite 10
> mm_page_alloc trace events.

Hm but tracing doesn't output to console so I don't see any trace events there.

> Please let me know if more information would be helpful.
> 
> 							Thanx, Paul
> 
> [1] https://docs.google.com/document/d/1M_SJ_hVUe-jtkH6BjHzzRFOnAiu1sqThlFnx8Jfp-fg/edit?usp=sharing
> [2] bbb3556c014d ("Merge tag 'keys-next-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd"), for example.
> [3] https://drive.google.com/file/d/1lZmIvatLyVYzdRK_rv2xNNWDBiGwyQvk/view?usp=sharing
> [4] https://drive.google.com/file/d/1jh4hJG877j0Rp65153wtmzKm6jErXZ9n/view?usp=sharing