Re: [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map.

Kui-Feng Lee <sinquersw@xxxxxxxxx> · Mon, 12 Aug 2024 11:10:16 -0700

On 8/12/24 10:31, Alexei Starovoitov wrote:
On Mon, Aug 12, 2024 at 10:15 AM Kui-Feng Lee <sinquersw@xxxxxxxxx> wrote:

On 8/12/24 09:58, Alexei Starovoitov wrote:
On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@xxxxxxxxx> wrote:
+
+       user_data_mmap = mmap(NULL, sizeof(*user_data_mmap), PROT_READ | PROT_WRITE,
+                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+       if (!ASSERT_NEQ(user_data_mmap, MAP_FAILED, "mmap"))
+               return;
+
+       memcpy(user_data_mmap, &user_data_mmap_v, sizeof(*user_data_mmap));
+       value.udata_mmap = user_data_mmap;
+       value.udata = &user_data;

There shouldn't be a need to do mmap(). It's too much memory overhead.
The user should be able to write:
static __thread struct user_data udata;
value.udata = &udata;
bpf_map_update_elem(map_fd, my_task_fd, &value)
and do it once.
Later multi thread user code will just access "udata".
No map lookups.

mmap() is not necessary here. There are two pointers here.
udata_mmap one is used to test the case crossing page boundary although
in the current RFC it fails to do it. It will be fixed later.
udata one works just like what you have described, except user_data is a
local variable.

Hmm. I guess I misread the code.
But then:
+       struct user_data user_data user_data = ...;
+       value.udata = &user_data;

how is that supposed to work when the address points to the stack?
I guess the kernel can still pin that page, but it will be junk
as soon as the function returns.

You are right! It works only for this test case since the map will be
destroyed before leaving this function. I will move it to a static variable.

If sizeof(udata) is small enough the kernel will pin either
one or two pages (if udata crosses page boundary).

So no extra memory consumption by the user process while the kernel
pins a page or two.
In a good case it's one page and no extra vmap.
I wonder whether we should enforce that one page case.
It's not hard for users to write:
static __thread struct user_data udata __attribute__((aligned(sizeof(udata))));

With one page restriction, the implementation would be much simpler. If
you think it is a reasonable restriction, I would enforce this rule.

I'm worried about vmap(). Doing it for every map elemen (same as every
task) might add substantial kernel side overhead.
On my devserver:
sudo cat /proc/vmallocinfo |grep vmap|wc -l
105
sudo cat /proc/vmallocinfo |wc -l
17608

I believe that the mechanism scales to millions, but adding one more
vmap per element feels like a footgun.
To avoid that the user would need to make sure their user_data doesn't
cross the page, so imo we can make this mandatory.

If the memory block that is pointed by a uptr takes only one page,
vmap() is not called. vmap() is called only for the cases that take two
or more pages. Without the one page restriction, there is a chance to
have additional vmaps even for small memory blocks, but not every uptr
having that extra vmap.

Users can accidentally create a new vmap. But, with current
implementation, they can also avoid it by aligning memory properly. The
trade-off is between supporting big chunks of memory and idiot-proof.
However, in my opinion, big chunks are very unlikely for task local storage.

So, I will make this restriction mandatory.