Re: Selftest failures related to kern_sync_rcu()

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Thu, 15 Apr 2021 00:47:27 +0200

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:

> On Wed, Apr 14, 2021 at 2:25 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>>
>> On Wed, Apr 14, 2021 at 09:18:09PM +0200, Toke Høiland-Jørgensen wrote:
>> > "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes:
>> >
>> > > On Wed, Apr 14, 2021 at 08:39:04PM +0200, Toke Høiland-Jørgensen wrote:
>> > >> "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes:
>> > >>
>> > >> > On Wed, Apr 14, 2021 at 10:59:23AM -0700, Alexei Starovoitov wrote:
>> > >> >> On Wed, Apr 14, 2021 at 10:52 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>> > >> >> >
>> > >> >> > > > > >                 if (num_online_cpus() > 1)
>> > >> >> > > > > >                         synchronize_rcu();
>> > >> >> >
>> > >> >> > In CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT_VOLUNTARY=y kernels, this
>> > >> >> > synchronize_rcu() will be a no-op anyway due to there only being the
>> > >> >> > one CPU.  Or are these failures all happening in CONFIG_PREEMPT=y kernels,
>> > >> >> > and in tests where preemption could result in the observed failures?
>> > >> >> >
>> > >> >> > Could you please send your .config file, or at least the relevant portions
>> > >> >> > of it?
>> > >> >>
>> > >> >> That's my understanding as well. I assumed Toke has preempt=y.
>> > >> >> Otherwise the whole thing needs to be root caused properly.
>> > >> >
>> > >> > Given that there is only a single CPU, I am still confused about what
>> > >> > the tests are expecting the membarrier() system call to do for them.
>> > >>
>> > >> It's basically a proxy for waiting until the objects are freed on the
>> > >> kernel side, as far as I understand...
>> > >
>> > > There are in-kernel objects that are freed via call_rcu(), and the idea
>> > > is to wait until these objects really are freed?  Or am I still missing
>> > > out on what is going on?
>> >
>> > Something like that? Although I'm not actually sure these are using
>> > call_rcu()? One of them needs __put_task_struct() to run, and the other
>> > waits for map freeing, with this comment:
>> >
>> >
>> >       /* we need to either wait for or force synchronize_rcu(), before
>> >        * checking for "still exists" condition, otherwise map could still be
>> >        * resolvable by ID, causing false positives.
>> >        *
>> >        * Older kernels (5.8 and earlier) freed map only after two
>> >        * synchronize_rcu()s, so trigger two, to be entirely sure.
>> >        */
>> >       CHECK(kern_sync_rcu(), "sync_rcu", "failed\n");
>> >       CHECK(kern_sync_rcu(), "sync_rcu", "failed\n");
>>
>> OK, so the issue is that the membarrier() system call is designed to force
>> ordering only within a user process, and you need it in the kernel.
>>
>> Give or take my being puzzled as to why the membarrier() system call
>> doesn't do it for you on a CONFIG_PREEMPT_NONE=y system, this brings
>> us back to the question Alexei asked me in the first place, what is the
>> best way to invoke an in-kernel synchronize_rcu() from userspace?
>>
>> You guys gave some reasonable examples.  Here are a few others:
>>
>> o       Bring a CPU online, then force it offline, or vice versa.
>>         But in this case, sys_membarrier() would do what you need
>>         given more than one CPU.
>>
>> o       Use the membarrier() system call, but require that the tests
>>         run on systems with at least two CPUs.
>>
>> o       Create a kernel module whose init function does a
>>         synchronize_rcu() and then returns failure.  This will
>>         avoid the overhead of removing that kernel module.
>>
>> o       Create a sysfs or debugfs interface that does a
>>         synchronize_rcu().
>>
>> But I am still concerned that you are needing more than synchronize_rcu()
>> can do.  Otherwise, the membarrier() system call would work just fine
>> on a single CPU on your CONFIG_PREEMPT_VOLUNTARY=y kernel.
>
> Selftests know internals of kernel implementation and wait for some
> objects to be freed with call_rcu(). So I think at this point the best
> way is just to go back to map-in-map or socket local storage.
> Map-in-map will probably work on older kernels, so I'd stick with that
> (plus all the code is there in the referenced commit). The performance
> and number of syscalls performed doesn't matter, really.

Just tried that (with the patch below, pulled from the commit you
referred), and that doesn't help. Still get this with a single CPU:

test_lookup_update:FAIL:map1_leak inner_map1 leaked!
#15/1 lookup_update:FAIL
#15 btf_map_in_map:FAIL

It's fine with 2 CPUs. And the other failures (in the task_local_storage
test) seem to have gone away entirely after I just pulled the newest
bpf-next...

-Toke

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 6396932b97e2..4c26d84a64dc 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -376,7 +376,25 @@ static int delete_module(const char *name, int flags)
  */
 int kern_sync_rcu(void)
 {
-	return syscall(__NR_membarrier, MEMBARRIER_CMD_SHARED, 0, 0);
+	int inner_map_fd, outer_map_fd, err, zero = 0;
+
+	inner_map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, 4, 4, 1, 0);
+	if (!ASSERT_LT(0, inner_map_fd, "inner_map_create"))
+		return -1;
+
+	outer_map_fd = bpf_create_map_in_map(BPF_MAP_TYPE_ARRAY_OF_MAPS, NULL,
+					     sizeof(int), inner_map_fd, 1, 0);
+	if (!ASSERT_LT(0, outer_map_fd, "outer_map_create")) {
+		close(inner_map_fd);
+		return -1;
+	}
+
+	err = bpf_map_update_elem(outer_map_fd, &zero, &inner_map_fd, 0);
+	if (err)
+		err = -errno;
+	ASSERT_OK(err, "outer_map_update");
+	close(inner_map_fd);
+	close(outer_map_fd);
 }
 
 static void unload_bpf_testmod(void)