Re: [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map.

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 12 Aug 2024 10:31:38 -0700

On Mon, Aug 12, 2024 at 10:15 AM Kui-Feng Lee <sinquersw@xxxxxxxxx> wrote:
>
>
>
> On 8/12/24 09:58, Alexei Starovoitov wrote:
> > On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@xxxxxxxxx> wrote:
> >> +
> >> +       user_data_mmap = mmap(NULL, sizeof(*user_data_mmap), PROT_READ | PROT_WRITE,
> >> +                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >> +       if (!ASSERT_NEQ(user_data_mmap, MAP_FAILED, "mmap"))
> >> +               return;
> >> +
> >> +       memcpy(user_data_mmap, &user_data_mmap_v, sizeof(*user_data_mmap));
> >> +       value.udata_mmap = user_data_mmap;
> >> +       value.udata = &user_data;
> >
> > There shouldn't be a need to do mmap(). It's too much memory overhead.
> > The user should be able to write:
> > static __thread struct user_data udata;
> > value.udata = &udata;
> > bpf_map_update_elem(map_fd, my_task_fd, &value)
> > and do it once.
> > Later multi thread user code will just access "udata".
> > No map lookups.
>
> mmap() is not necessary here. There are two pointers here.
> udata_mmap one is used to test the case crossing page boundary although
> in the current RFC it fails to do it. It will be fixed later.
> udata one works just like what you have described, except user_data is a
> local variable.

Hmm. I guess I misread the code.
But then:
+       struct user_data user_data user_data = ...;
+       value.udata = &user_data;

how is that supposed to work when the address points to the stack?
I guess the kernel can still pin that page, but it will be junk
as soon as the function returns.

> >
> > If sizeof(udata) is small enough the kernel will pin either
> > one or two pages (if udata crosses page boundary).
> >
> > So no extra memory consumption by the user process while the kernel
> > pins a page or two.
> > In a good case it's one page and no extra vmap.
> > I wonder whether we should enforce that one page case.
> > It's not hard for users to write:
> > static __thread struct user_data udata __attribute__((aligned(sizeof(udata))));
>
> With one page restriction, the implementation would be much simpler. If
> you think it is a reasonable restriction, I would enforce this rule.

I'm worried about vmap(). Doing it for every map elemen (same as every
task) might add substantial kernel side overhead.
On my devserver:
sudo cat /proc/vmallocinfo |grep vmap|wc -l
105
sudo cat /proc/vmallocinfo |wc -l
17608

I believe that the mechanism scales to millions, but adding one more
vmap per element feels like a footgun.
To avoid that the user would need to make sure their user_data doesn't
cross the page, so imo we can make this mandatory.