Re: [PATCH bpf-next 04/16] bpf: Introduce bpf_arena.

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 8 Feb 2024 15:36:53 -0800

On Thu, Feb 8, 2024 at 1:58 PM Barret Rhoden <brho@xxxxxxxxxx> wrote:
>
> On 2/8/24 01:26, Alexei Starovoitov wrote:
> > Also I believe I addressed all issues with missing mutex and wrap around,
> > and pushed to:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=e1cb522fee661e7346e8be567eade9cf607eaf11
> > Please take a look.
>
> LGTM, thanks.
>
> minor things:
>
> > +static void arena_vm_close(struct vm_area_struct *vma)
> > +{
> > +     struct vma_list *vml;
> > +
> > +     vml = vma->vm_private_data;
> > +     list_del(&vml->head);
> > +     vma->vm_private_data = NULL;
> > +     kfree(vml);
> > +}
>
> i think this also needs protected by the arena mutex.  otherwise two
> VMAs that close at the same time can corrupt the arena vma_list.  or a
> VMA that closes while you're zapping.

Excellent catch.

> remember_vma() already has the mutex held, since it's called from mmap.
>
> > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
> > +{
> > +     long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
>
> this function and arena_free_pages() are both using user_vm_start/end
> before grabbing the mutex.  so need to grab the mutex very early.
>
> alternatively, you could make it so that the user must set the
> user_vm_start via map_extra, so you don't have to worry about these
> changing after the arena is created.

Looks like I lost the diff hunk where verifier checks that
arena has user_vm_start set before loading the prog.
And for some reason forgot to remove
if (!arena->user_vm_start) return..
in bpf_arena_alloc/free_page().
I'll remove the latter and add the verifier enforcement back.
The intent was to never call arena_alloc/free_pages when the arena is
not fully formed.
Once it's fixed there will be no race in arena_alloc_pages().
user_vm_end/start are fixed before the program is loaded.

One more thing.
The vmap_pages_range_wrap32() fix that you saw in that commit is not
enough.
Turns out that [%r12 + src_reg + off] in JIT asm doesn't
fully conform to "kernel bounds all access into 32-bit".
That "+ off" part is added _after_ src_reg is bounded to 32-bit.
Remember, that was the reason we added guard pages before and after
kernel 4Gb vm area.
It's working as intended, but for this wrap32 case we need to
map one page into the normal kernel vma _and_ into the guard page.
Consider your example:
user_start_va = 0x1,fffff000
user_end_va =   0x2,fffff000

the pgoff = 0 is uaddr 0x1,fffff000.
It's kaddr = kern_vm_start + 0xfffff000
and kaddr + PAGE_SIZE is kern_vm_start + 0.

When bpf prog access an arena pointer it can do:
dst_reg = *(u64 *)(src_reg + 0)
and
dst_reg = *(u64 *)(src_reg + 4096)

the first LDX is fine, but the 2nd will be faulting
when src_reg is fffff000.
>From user space pov it's a virtually contiguous address range.
For bpf prog it's also contiguous when src_reg is 32-bit bounded,
but "+ 4096" breaks that.
The 2nd load becomes:
kern_vm_start + 0xfffff000 + 4096
and it faults.
Theoretically a solution is to do:
kern_vm_start + (u32)(0xfffff000 + 4096)
in JIT, but that is too expensive.

Hence I went with arena fix (ignore lack of error checking):
static int vunmap_guard_pages(u64 kern_vm_start, u64 start, u64 end)
{
        end = (u32)end;
        if (start < S16_MAX) {
                u64 end1 = min(end, S16_MAX + 1);

                vunmap_range(kern_vm_start + (1ull << 32) + start,
                             kern_vm_start + (1ull << 32) + end1);
        }

        if (end >= U32_MAX - S16_MAX + 1) {
                u64 start2 = max(start, U32_MAX - S16_MAX + 1);

                vunmap_range(kern_vm_start - (1ull << 32) + start2,
                             kern_vm_start - (1ull << 32) + end);
        }
        return 0;
}
static int vmap_pages_range_wrap32(u64 kern_vm_start, u64 uaddr, u64 page_cnt,
                                   struct page **pages)
{
        u64 start = kern_vm_start + uaddr;
        u64 end = start + page_cnt * PAGE_SIZE;
        u64 part1_page_cnt, start2, end2;
        int ret;

        if (page_cnt == 1 || !((uaddr + page_cnt * PAGE_SIZE) >> 32)) {
                /* uaddr doesn't overflow in 32-bit */
                ret = vmap_pages_range(start, end, PAGE_KERNEL, pages,
PAGE_SHIFT);
                if (ret)
                        return ret;
                vmap_guard_pages(kern_vm_start, uaddr, uaddr +
page_cnt * PAGE_SIZE, pages);
                return 0;
        }

        part1_page_cnt = ((1ull << 32) - (u32)uaddr) >> PAGE_SHIFT;
        end = start + part1_page_cnt * PAGE_SIZE;
        ret = vmap_pages_range(start, end,
                               PAGE_KERNEL, pages, PAGE_SHIFT);
        if (ret)
            return ret;

        vmap_guard_pages(kern_vm_start, uaddr, uaddr + part1_page_cnt
* PAGE_SIZE, pages);

        start2 = kern_vm_start;
        end2 = start2 + (page_cnt - part1_page_cnt) * PAGE_SIZE;
        ret = vmap_pages_range(start2, end2,
                               PAGE_KERNEL, &pages[part1_page_cnt], PAGE_SHIFT);
        if (ret) {
                vunmap_range(start, end);
                return ret;
        }

        vmap_guard_pages(kern_vm_start, 0, (page_cnt - part1_page_cnt)
* PAGE_SIZE,
                         pages + part1_page_cnt);
        return 0;
}

It's working, but too complicated.
Instead of single vmap_pages_range()
we might need to do up to 4 calls and map certain pages into
two places to make both 64-bit virtual addresses:
kern_vm_start + 0xfffff000 + 4096
and
kern_vm_start + (u32)(0xfffff000 + 4096)
point to the same page.

I'm inclined to tackle wrap32 issue differently and simply
disallow [user_vm_start, user_vm_end] combination
where lower 32-bit can wrap.

In other words it would mean that mmap() of len=4Gb will be
aligned to 4Gb,
while mmap() of len=1M will be offsetted in such a way
that both addr and add+1M have the same upper 32-bit.
(It's not the same as 1M aligned).

With that I will remove vmap_pages_range_wrap32() and
do single normal vmap_pages_range() without extra tricks.

wdyt?