Re: BPF map freezing is unreliable; can we instead just inline constants?

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 14 Apr 2020 12:46:20 -0700

On Tue, Apr 14, 2020 at 9:07 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Fri, Apr 10, 2020 at 10:48 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> > On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
> > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> > > > >
> > > > > Hi!
> > > > >
> > > > > I saw that BPF allows root to create frozen maps, for which the
> > > > > verifier then assumes that they contain constant values. However, map
> > > > > freezing is pretty wobbly:
> > > > >
> > > > > 1. The syscalls for updating maps from userspace don't seem to lock
> > > > > the map at all.
> > > >
> > > > True, there is a tiny race between freezing and map updates, but I
> > > > don't think it's possible to solve it without taking locks all around
> > > > the place in map update operations.
> > >
> > > Yeah. So I think BPF should do exactly that. Or change the userspace
> > > API so that userspace has to say at map creation time "I'll freeze
> > > this map later", and then you only have to do the locking if that flag
> > > is set.
> >
> > I'd love to be able to create frozen maps from the get go (and specify
> > initial values for the map), but freezing is done the way it's done
> > already, unfortunately :(
> > Regarding locking, maps could be updated from BPF program side as
> > well. I'd be curious to hear what others think about this issue.
>
> Map freezing only has an effect on the verifier if the map was created
> with BPF_F_RDONLY_PROG, which prevents the BPF program from updating
> the map, right?

Yes, good point.

>
> > > > > 3. It is assumed that a memory mapping can't be used to write to a
> > > > > page anymore after the mapping has been removed; but actually,
> > > > > userspace can grab references to pages in a VMA and use those
> > > > > references to write to the VMA's pages after the VMA has already been
> > > > > closed. (crasher attached as bpf-constant-data-uffd.c, compile with
> > > > > "gcc -pthread ...")
> > > >
> > > > Please help me understand how that works (assuming we drop
> > > > VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then
> > > > freeze(). After munmap() refcount of writable pages should drop to
> > > > zero. And mmap'ed address should be invalid and unmapped. I'm missing
> > > > how after munmap() parallel thread still can write to that memory
> > > > page?
> > >
> > > The mmap()/munmap() syscalls place references to the pages the kernel
> > > is using in the page tables of the process. Some other syscalls (such
> > > as process_vm_writev()) can read these page table entries, take their
> > > own references on those backing pages, and then continue to access
> > > those pages even after they've been removed from the task's page
> > > tables by munmap(). This works as long as the page table entries don't
> > > have magic marker bits on them that prohibit this, which you get if
> > > you use something like remap_pfn_range() in a loop instead of
> > > remap_vmalloc_range() - but the memory mappings created by that
> > > syscall are weird, and e.g. some syscalls like read() and write()
> > > might sometimes fail if the buffer argument points into such a memory
> > > region.
> >
> > So mmap() subsystem won't event know about those extra references and
> > thus we can't really account that in our code, right? That's sad, but
> > hopefully those APIs are root-only?
>
> Nope, those APIs are reachable by normal users. These extra references
> are counted on the page refcount - since they have to be tracked
> somehow - but as far as I know, that refcount can also be elevated
> spuriously, so triggering hard errors based on it is probably not a
> good idea.
>

Just trying to educate myself and you seem to know a lot about this.
If we think about regular file memory-mapping with mmap(). According
to this, it seems like it would be possible to mmap() file as writable
first, do some changes and then munmap. At this point some application
would assume that file can't be modified anymore and can be treated as
read-only, yet, it's possible that some other process/thread can just
go and still modify file contents. Do I understand correctly?

>
> > > > > Is there a reason why the verifier doesn't replace loads from frozen
> > > > > maps with the values stored in those maps? That seems like it would be
> > > > > not only easier to secure, but additionally more performant.
> > > >
> > > > Verifier doesn't always know exact offset at which program is going to
> > > > read read-only map contents. So something like this works:
> > > >
> > > > const volatile long arr[256];
> > > >
> > > > int x = rand() % 256;
> > > > int y = arr[x];
> > > >
> > > > In this case verifier doesn't really know the value of y, so it can't
> > > > be inlined. Then you can have code in which in one branch register is
> > > > loaded with known value, but in another branch same register gets some
> > > > value at random offset. Constant tracking is code path-sensitive,
> > > > while instructions are shared between different code paths. Unless I'm
> > > > missing what you are proposing :)
> > >
> > > Ah, I missed that possibility. But is that actually something that
> > > people do in practice? Or would it be okay for the verifier to just
> > > assume an unknown value in these cases?
> >
> > Verifier will assume unknown value for the branch that has variable
> > offset. It can't do the same for another branch (with constant offset)
> > because it might not yet have encountered branch with variable offset.
> > But either way, you were proposing to rewrite instruction and inline
> > read constant, and I don't think it's possible because of this.
>
> Ah, I see what you mean. That sucks. I guess that means that to fix
> this up properly in such edgecases, we'd have to, for each memory
> read, keep track of all the values that we want to hardcode for it,
> and then generate branches in the unlikely case that the instruction
> was reached on paths that expect different values?

I guess, though that sounds extreme and extremely unlikely. I'd say
the better way would be to implement immutable BPF maps from the time
they are created. E.g., at the time of creating map, you specify extra
flag BPF_F_IMMUTABLE and specify pointer to a blob of memory with
key/value pairs in it.