On Fri, Apr 10, 2020 at 1:47 AM Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Fri, Apr 10, 2020 at 1:33 AM Andrii Nakryiko > <andrii.nakryiko@xxxxxxxxx> wrote: > > On Wed, Apr 8, 2020 at 12:42 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > > > Hi! > > > > > > I saw that BPF allows root to create frozen maps, for which the > > > verifier then assumes that they contain constant values. However, map > > > freezing is pretty wobbly: > > > > > > 1. The syscalls for updating maps from userspace don't seem to lock > > > the map at all. > > > > True, there is a tiny race between freezing and map updates, but I > > don't think it's possible to solve it without taking locks all around > > the place in map update operations. > > Yeah. So I think BPF should do exactly that. Or change the userspace > API so that userspace has to say at map creation time "I'll freeze > this map later", and then you only have to do the locking if that flag > is set. I'd love to be able to create frozen maps from the get go (and specify initial values for the map), but freezing is done the way it's done already, unfortunately :( Regarding locking, maps could be updated from BPF program side as well. I'd be curious to hear what others think about this issue. > > [...] > > > 3. It is assumed that a memory mapping can't be used to write to a > > > page anymore after the mapping has been removed; but actually, > > > userspace can grab references to pages in a VMA and use those > > > references to write to the VMA's pages after the VMA has already been > > > closed. (crasher attached as bpf-constant-data-uffd.c, compile with > > > "gcc -pthread ...") > > > > Please help me understand how that works (assuming we drop > > VM_MAYWRITE, of course). You mmap() as R/W, then munmap(), then > > freeze(). After munmap() refcount of writable pages should drop to > > zero. And mmap'ed address should be invalid and unmapped. I'm missing > > how after munmap() parallel thread still can write to that memory > > page? > > The mmap()/munmap() syscalls place references to the pages the kernel > is using in the page tables of the process. Some other syscalls (such > as process_vm_writev()) can read these page table entries, take their > own references on those backing pages, and then continue to access > those pages even after they've been removed from the task's page > tables by munmap(). This works as long as the page table entries don't > have magic marker bits on them that prohibit this, which you get if > you use something like remap_pfn_range() in a loop instead of > remap_vmalloc_range() - but the memory mappings created by that > syscall are weird, and e.g. some syscalls like read() and write() > might sometimes fail if the buffer argument points into such a memory > region. So mmap() subsystem won't event know about those extra references and thus we can't really account that in our code, right? That's sad, but hopefully those APIs are root-only? > > [...] > > > Is there a reason why the verifier doesn't replace loads from frozen > > > maps with the values stored in those maps? That seems like it would be > > > not only easier to secure, but additionally more performant. > > > > Verifier doesn't always know exact offset at which program is going to > > read read-only map contents. So something like this works: > > > > const volatile long arr[256]; > > > > int x = rand() % 256; > > int y = arr[x]; > > > > In this case verifier doesn't really know the value of y, so it can't > > be inlined. Then you can have code in which in one branch register is > > loaded with known value, but in another branch same register gets some > > value at random offset. Constant tracking is code path-sensitive, > > while instructions are shared between different code paths. Unless I'm > > missing what you are proposing :) > > Ah, I missed that possibility. But is that actually something that > people do in practice? Or would it be okay for the verifier to just > assume an unknown value in these cases? Verifier will assume unknown value for the branch that has variable offset. It can't do the same for another branch (with constant offset) because it might not yet have encountered branch with variable offset. But either way, you were proposing to rewrite instruction and inline read constant, and I don't think it's possible because of this.