Re: BPF map freezing is unreliable; can we instead just inline constants?

Jann Horn <jannh@xxxxxxxxxx> · Fri, 10 Apr 2020 10:46:45 +0200

On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
> On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> >
> > Hi!
> >
> > I saw that BPF allows root to create frozen maps, for which the
> > verifier then assumes that they contain constant values. However, map
> > freezing is pretty wobbly:
> >
> > 1. The syscalls for updating maps from userspace don't seem to lock
> > the map at all.
>
> True, there is a tiny race between freezing and map updates, but I
> don't think it's possible to solve it without taking locks all around
> the place in map update operations.

Yeah. So I think BPF should do exactly that. Or change the userspace
API so that userspace has to say at map creation time "I'll freeze
this map later", and then you only have to do the locking if that flag
is set.

[...]
> > 3. It is assumed that a memory mapping can't be used to write to a
> > page anymore after the mapping has been removed; but actually,
> > userspace can grab references to pages in a VMA and use those
> > references to write to the VMA's pages after the VMA has already been
> > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > "gcc -pthread ...")
>
> Please help me understand how that works (assuming we drop
> VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> freeze(). After munmap() refcount of writable pages should drop to
> zero. And mmap'ed address should be invalid and unmapped. I'm missing
> how after munmap() parallel thread still can write to that memory
> page?

The mmap()/munmap() syscalls place references to the pages the kernel
is using in the page tables of the process. Some other syscalls (such
as process_vm_writev()) can read these page table entries, take their
own references on those backing pages, and then continue to access
those pages even after they've been removed from the task's page
tables by munmap(). This works as long as the page table entries don't
have magic marker bits on them that prohibit this, which you get if
you use something like remap_pfn_range() in a loop instead of
remap_vmalloc_range() - but the memory mappings created by that
syscall are weird, and e.g. some syscalls like read() and write()
might sometimes fail if the buffer argument points into such a memory
region.

[...]
> > Is there a reason why the verifier doesn't replace loads from frozen
> > maps with the values stored in those maps? That seems like it would be
> > not only easier to secure, but additionally more performant.
>
> Verifier doesn't always know exact offset at which program is going to
> read read-only map contents. So something like this works:
>
> const volatile long arr[256];
>
> int x = rand() % 256;
> int y = arr[x];
>
> In this case verifier doesn't really know the value of y, so it can't
> be inlined. Then you can have code in which in one branch register is
> loaded with known value, but in another branch same register gets some
> value at random offset. Constant tracking is code path-sensitive,
> while instructions are shared between different code paths. Unless I'm
> missing what you are proposing :)

Ah, I missed that possibility. But is that actually something that
people do in practice? Or would it be okay for the verifier to just
assume an unknown value in these cases?