On Fri, Apr 8, 2022 at 3:04 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Wed, Apr 6, 2022 at 3:23 PM Andrii Nakryiko > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > On Fri, Apr 1, 2022 at 7:00 PM Joanne Koong <joannekoong@xxxxxx> wrote: > > > > > > From: Joanne Koong <joannelkoong@xxxxxxxxx> > > > > > > This patch adds 3 new APIs and the bulk of the verifier work for > > > supporting dynamic pointers in bpf. > > > > > > There are different types of dynptrs. This patch starts with the most > > > basic ones, ones that reference a program's local memory > > > (eg a stack variable) and ones that reference memory that is dynamically > > > allocated on behalf of the program. If the memory is dynamically > > > allocated by the program, the program *must* free it before the program > > > exits. This is enforced by the verifier. > > > > > > The added APIs are: > > > > > > long bpf_dynptr_from_mem(void *data, u32 size, struct bpf_dynptr *ptr); > > > long bpf_malloc(u32 size, struct bpf_dynptr *ptr); > > > void bpf_free(struct bpf_dynptr *ptr); > > > > > > This patch sets up the verifier to support dynptrs. Dynptrs will always > > > reside on the program's stack frame. As such, their state is tracked > > > in their corresponding stack slot, which includes the type of dynptr > > > (DYNPTR_LOCAL vs. DYNPTR_MALLOC). > > > > > > When the program passes in an uninitialized dynptr (ARG_PTR_TO_DYNPTR | > > > MEM_UNINIT), the stack slots corresponding to the frame pointer > > > where the dynptr resides at is marked as STACK_DYNPTR. For helper functions > > > that take in iniitalized dynptrs (such as the next patch in this series > > > which supports dynptr reads/writes), the verifier enforces that the > > > dynptr has been initialized by checking that their corresponding stack > > > slots have been marked as STACK_DYNPTR. Dynptr release functions > > > (eg bpf_free) will clear the stack slots. The verifier enforces at program > > > exit that there are no dynptr stack slots that need to be released. > > > > > > There are other constraints that are enforced by the verifier as > > > well, such as that the dynptr cannot be written to directly by the bpf > > > program or by non-dynptr helper functions. The last patch in this series > > > contains tests that trigger different cases that the verifier needs to > > > successfully reject. > > > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > > --- > > > include/linux/bpf.h | 74 ++++++++- > > > include/linux/bpf_verifier.h | 18 +++ > > > include/uapi/linux/bpf.h | 40 +++++ > > > kernel/bpf/helpers.c | 88 +++++++++++ > > > kernel/bpf/verifier.c | 266 ++++++++++++++++++++++++++++++++- > > > scripts/bpf_doc.py | 2 + > > > tools/include/uapi/linux/bpf.h | 40 +++++ > > > 7 files changed, 521 insertions(+), 7 deletions(-) > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > index cb9f42866cde..e0fcff9f2aee 100644 > > > --- a/include/linux/bpf.h > > > +++ b/include/linux/bpf.h > > > @@ -346,7 +346,13 @@ enum bpf_type_flag { > > > > > > MEM_RELEASE = BIT(6 + BPF_BASE_TYPE_BITS), > > > > > > - __BPF_TYPE_LAST_FLAG = MEM_RELEASE, > > > + /* DYNPTR points to a program's local memory (eg stack variable). */ > > > + DYNPTR_TYPE_LOCAL = BIT(7 + BPF_BASE_TYPE_BITS), > > > + > > > + /* DYNPTR points to dynamically allocated memory. */ > > > + DYNPTR_TYPE_MALLOC = BIT(8 + BPF_BASE_TYPE_BITS), > > > + > > > + __BPF_TYPE_LAST_FLAG = DYNPTR_TYPE_MALLOC, > > > }; > > > > > > /* Max number of base types. */ > > > @@ -390,6 +396,7 @@ enum bpf_arg_type { > > > ARG_PTR_TO_STACK, /* pointer to stack */ > > > ARG_PTR_TO_CONST_STR, /* pointer to a null terminated read-only string */ > > > ARG_PTR_TO_TIMER, /* pointer to bpf_timer */ > > > + ARG_PTR_TO_DYNPTR, /* pointer to bpf_dynptr. See bpf_type_flag for dynptr type */ > > > __BPF_ARG_TYPE_MAX, > > > > > > /* Extended arg_types. */ > > > @@ -2396,4 +2403,69 @@ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args, > > > u32 **bin_buf, u32 num_args); > > > void bpf_bprintf_cleanup(void); > > > > > > +/* the implementation of the opaque uapi struct bpf_dynptr */ > > > +struct bpf_dynptr_kern { > > > + u8 *data; > > > > nit: u8 * is too specific, it's not always "bytes" of data. Let's use `void *`? > Sounds great! My reason for going with u8 * instead of void * is that > void pointer arithmetic in C is invalid - but it seems like this isn't > something we have to worry about here since gcc is the default > compiler for linux and gcc allows it as an extension Right, we do void * arithmetic everywhere. I thought this restriction is C++-specific. But there might be GCC extensions as well. > > > > > + /* The upper 4 bits are reserved. Bit 29 denotes whether the > > > + * dynptr is read-only. Bits 30 - 32 denote the dynptr type. > > > + */ > > > > not essential, but I think using highest bit for read-only and then > > however many next upper bits for dynptr kind is a bit cleaner > > approach. > I'm happy with either - I was thinking if we have the uppermost bits > be dynptr kind, then that makes it easiest to get the dynptr type > (simply size >> DYNPTR_TYPE_SHIFT) whereas if the read-only bit is the > highest bit, then we also need to clear that out. But not a big deal > :) I think you'll want to define DYNPTR_TYPE_MASK anyways and then it's just (size >> DYNPTR_TYPE_SHIFT) & DYNPTR_TYPE_MASK. > > > > also it seems like normally bits are zero-indexed, so, pedantically, > > there is no bit 32, it's bit #31 > > > > > + u32 size; > > > + u32 offset; > > > > Let's document the semantics of offset and size. E.g., if I have > > offset 4 and size 20, does it mean there were 24 bytes, but we ignore > > first 4 and can address next 20, or does it mean that there is 20 > > bytes, we skip first 4 and have 16 addressable. Basically, usable size > > is just size of size - offset? That will change how/whether the size > > is adjusted when offset is moved. > > > > > +} __aligned(8); > > > + > > > +enum bpf_dynptr_type { > > > > it's a good idea to have default zero value to be BPF_DYNPTR_TYPE_INVALID > > > > > + /* Local memory used by the bpf program (eg stack variable) */ > > > + BPF_DYNPTR_TYPE_LOCAL, > > > + /* Memory allocated dynamically by the kernel for the dynptr */ > > > + BPF_DYNPTR_TYPE_MALLOC, > > > +}; > > > + > > > +/* The upper 4 bits of dynptr->size are reserved. Consequently, the > > > + * maximum supported size is 2^28 - 1. > > > + */ > > > +#define DYNPTR_MAX_SIZE ((1UL << 28) - 1) > > > +#define DYNPTR_SIZE_MASK 0xFFFFFFF > > > +#define DYNPTR_TYPE_SHIFT 29 > > > > I'm thinking that maybe we should start with reserving entire upper > > byte in size and offset to be on the safer side? And if 16MB of > > addressable memory blob isn't enough, we can always relaxed it later. > > WDYT? > > > This sounds great! I will make this change for v2 sounds good > > > + > > > +static inline enum bpf_dynptr_type bpf_dynptr_get_type(struct bpf_dynptr_kern *ptr) > > > +{ > > > + return ptr->size >> DYNPTR_TYPE_SHIFT; > > > +} > > > + > > > +static inline void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_type type) > > > +{ > > > + ptr->size |= type << DYNPTR_TYPE_SHIFT; > > > +} > > > + > > > +static inline u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr) > > > +{ > > > + return ptr->size & DYNPTR_SIZE_MASK; > > > +} > > > + > > > +static inline int bpf_dynptr_check_size(u32 size) > > > +{ > > > + if (size == 0) > > > + return -EINVAL; > > > > What's the downside of allowing size 0? Honest question. I'm wondering > > why prevent having dynptr pointing to an "empty slice"? It might be a > > useful feature in practice. > I don't see the use of dynptrs that point to something of size 0, so I > thought it'd be simplest to just return an -EINVAL if the user tries > to create one. I don't have a particular preference for handling this > though - especially if this will be a useful feature in the future, > then I agree we should just let the user create and use empty slices > if they wish to. taking Go slices as an example, empty slice is a useful thing sometimes, makes some algorithms more uniform. > > > > > + > > > + if (size > DYNPTR_MAX_SIZE) > > > + return -E2BIG; > > > + > > > + return 0; > > > +} > > > + > > > +static inline int bpf_dynptr_check_off_len(struct bpf_dynptr_kern *ptr, u32 offset, u32 len) > > > +{ > > > + u32 capacity = bpf_dynptr_get_size(ptr) - ptr->offset; > > > + > > > + if (len > capacity || offset > capacity - len) > > > + return -EINVAL; > > > + > > > + return 0; > > > +} > > > + > > > +void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data, enum bpf_dynptr_type type, > > > + u32 offset, u32 size); > > > + > > > +void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr); > > > + > > > #endif /* _LINUX_BPF_H */ > > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > > > index 7a01adc9e13f..bc0f105148f9 100644 > > > --- a/include/linux/bpf_verifier.h > > > +++ b/include/linux/bpf_verifier.h > > > @@ -72,6 +72,18 @@ struct bpf_reg_state { > > > > > > u32 mem_size; /* for PTR_TO_MEM | PTR_TO_MEM_OR_NULL */ > > > > > > + /* for dynptr stack slots */ > > > + struct { > > > + enum bpf_dynptr_type dynptr_type; > > > + /* A dynptr is 16 bytes so it takes up 2 stack slots. > > > + * We need to track which slot is the first slot > > > + * to protect against cases where the user may try to > > > + * pass in an address starting at the second slot of the > > > + * dynptr. > > > + */ > > > + bool dynptr_first_slot; > > > + }; > > > > why not > > > > struct { > > enum bpf_dynptr_type type; > > bool first_lot; > > } dynptr; > > > > ? I think it's cleaner grouping > Agreed! I will make this change for v2 > > > [...] > > > > > + * Description > > > + * Dynamically allocate memory of *size* bytes. > > > + * > > > + * Every call to bpf_malloc must have a corresponding > > > + * bpf_free, regardless of whether the bpf_malloc > > > + * succeeded. > > > + * > > > + * The maximum *size* supported is DYNPTR_MAX_SIZE. > > > + * Return > > > + * 0 on success, -ENOMEM if there is not enough memory for the > > > + * allocation, -EINVAL if the size is 0 or exceeds DYNPTR_MAX_SIZE. > > > + * > > > + * void bpf_free(struct bpf_dynptr *ptr) > > > > thinking about the next patch set that will add storing this malloc > > dynptr into the map, bpf_free() will be a lie, right? As it will only > > decrement a refcnt, not necessarily free it, right? So maybe just > > generic bpf_dynptr_put() or bpf_malloc_put() or something like that is > > a bit more "truthful"? > I like the simplicity of bpf_free(), but I can see how that might be > confusing. What are your thoughts on "bpf_dynptr_free()"? Since when > we get into dynptrs that are stored in maps vs. dynptrs stored > locally, calling bpf_dynptr_free() frees (invalidates) your local > dynptr even if it doesn't free the underlying memory if it still has > valid refcounts on it? To me, "malloc" and "_free" go more intuitively > together as a pair. Sounds good to me (though let's use _dynptr() as a suffix consistently). I also just realized that maybe we should call bpf_malloc() a bpf_malloc_dynptr() instead. I can see how we might want to enable plain bpf_malloc() with statically known size (similar to statically known bpf_ringbuf_reserve()) for temporary local malloc with direct memory access? So bpf_malloc_dynptr() would be a dynptr-enabled counterpart to fixed-sized bpf_malloc()? And then bpf_free() will work with direct pointer returned from bpf_malloc(), while bpf_free_dynptr() will work with dynptr returned from bpf_malloc_dynptr(). > > > > > + * Description > > > + * Free memory allocated by bpf_malloc. > > > + * > > > + * After this operation, *ptr* will be an invalidated dynptr. > > > + * Return > > > + * Void. > > > */ > [...] > > > +const struct bpf_func_proto bpf_dynptr_from_mem_proto = { > > > + .func = bpf_dynptr_from_mem, > > > + .gpl_only = false, > > > + .ret_type = RET_INTEGER, > > > + .arg1_type = ARG_PTR_TO_MEM, > > > > need to think what to do with uninit stack slots. Do we need > > bpf_dnptr_from_uninit_mem() or we just allow ARG_PTR_TO_MEM | > > MEM_UNINIT here? > I think we can just change this to ARG_PTR_TO_MEM | MEM_UNINIT. sgtm > > > > > + .arg2_type = ARG_CONST_SIZE_OR_ZERO, > > > + .arg3_type = ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_LOCAL | MEM_UNINIT, > > > +}; > > > + > > > +BPF_CALL_2(bpf_malloc, u32, size, struct bpf_dynptr_kern *, ptr) > > > +{ > > > + void *data; > > > + int err; > > > + > > > + err = bpf_dynptr_check_size(size); > > > + if (err) { > > > + bpf_dynptr_set_null(ptr); > > > + return err; > > > + } > > > + > > > + data = kmalloc(size, GFP_ATOMIC); > > > > we have this fancy logic now to allow non-atomic allocation inside > > sleepable programs, can we use that here as well? In sleepable mode it > > would be nice to wait for malloc() to grab necessary memory, if > > possible. > Agreed - I'm planning to do this in a later "dynptr optimizations" > patchset (which will also include inlining BPF instructions for some > of the helper functions) great, ok > > > > > + if (!data) { > > > + bpf_dynptr_set_null(ptr); > > > + return -ENOMEM; > > > + } > > > + > > > > so.... kmalloc() doesn't zero initialize the memory. I think it's a > > great property (which we can later modify with flags, if necessary), > > so I'd do zero-initialization by default. we can keep calling it > > bpf_malloc() instead of bpf_zalloc(), of course. > > > [...] > > > +static inline int get_spi(s32 off) > > > +{ > > > + return (-off - 1) / BPF_REG_SIZE; > > > +} > > > + > > > +static bool check_spi_bounds(struct bpf_func_state *state, int spi, u32 nr_slots) > > > > "check_xxx"/"validate_xxx" pattern has ambiguity when it comes to > > interpreting its return value. In some cases it would be 0 for success > > and <0 for error, in this it's true/false where probably true meaning > > all good. It's unfortunate to have to think about this when reading > > code. If you call it something like "is_stack_range_valid" it would be > > much more natural to read and reason about, IMO. > Great point! I'll change this naming for v2 > > > > BTW, what does "spi" stand for? "stack pointer index"? slot_idx? > It's not formally documented anywhere but I assume it's short for > "stack pointer index". > > > > > +{ > > > + int allocated_slots = state->allocated_stack / BPF_REG_SIZE; > > > + > > > + return allocated_slots > spi && nr_slots - 1 <= spi; > > > > ok, this is personal preferences, but it took me considerable time to > > try to understand what's being checked here (this backwards grow of > > slot indices also threw me off). But seems like we have a range of > > slots that are calculated as [spi - nr_slots + 1, spi] and we want to > > check that it's within [0, allocated_stack), so most straightforward > > way would be: > > > > return spi - nr_slots + 1 >= 0 && spi < allocated_slots; > > > > And I'd definitely leave a comment about this whole index grows > > downwards (it's not immediately obvious even if you know that indices > > are derived from negative stack offsets) > Awesome, I will make these edits for v2 > > > [...] > > > + switch (type) { > > > + case DYNPTR_TYPE_LOCAL: > > > + *dynptr_type = BPF_DYNPTR_TYPE_LOCAL; > > > + break; > > > + case DYNPTR_TYPE_MALLOC: > > > + *dynptr_type = BPF_DYNPTR_TYPE_MALLOC; > > > + break; > > > + default: > > > + /* Can't have more than one type set and can't have no > > > + * type set > > > + */ > > > + return -EINVAL; > > > > see above about BPF_DYNPTR_TYPE_INVALID, with that you don't have to > > use out parameter, just return enum bpf_dynptr_type directly with > > BPF_DYNPTR_TYPE_INVALID marking an error > Nice! I love this suggestion - it makes this a lot smoother. > > > > > + } > > > + > > > + return 0; > > > +} > > > + > [...] > > > + > > > +/* Check if the dynptr argument is a proper initialized dynptr */ > > > +static bool check_dynptr_init(struct bpf_verifier_env *env, struct bpf_reg_state *reg, > > > + enum bpf_arg_type arg_type) > > > > is_dynptr_valid()? You are not checking if it's just initialized but > > also that it matches arg_type, right? Also see my rambling about > > check_xxx naming > I will rename this to is_dynptr_init_valid(). is_dynptr_valid() might > be too generic - for example, a valid dynptr should also have its > stack slots marked accordingly, which isn't true here since this > dynptr is uninitialized. I think is_dynptr_init_valid() will be > clearest sgtm > > > > > +{ > > > + struct bpf_func_state *state = func(env, reg); > > > + enum bpf_dynptr_type expected_type; > > > + int spi, err; > > > + > > > + /* Can't pass in a dynptr at a weird offset */ > > > + if (reg->off % BPF_REG_SIZE) > > > + return false; > > > + > > > + spi = get_spi(reg->off); > > > + > > > + if (!check_spi_bounds(state, spi, BPF_DYNPTR_NR_SLOTS)) > > > + return false; > > > + > > > + if (!state->stack[spi].spilled_ptr.dynptr_first_slot) > > > + return false; > > > + > > > + if (state->stack[spi].slot_type[0] != STACK_DYNPTR) > > > + return false; > > > + > > > + /* ARG_PTR_TO_DYNPTR takes any type of dynptr */ > > > + if (arg_type == ARG_PTR_TO_DYNPTR) > > > + return true; > > > + > > > + err = arg_to_dynptr_type(arg_type, &expected_type); > > > + if (unlikely(err)) > > > + return err; > > > + > > > + return state->stack[spi].spilled_ptr.dynptr_type == expected_type; > > > +} > [...] > > > +/* > > > + * Determines whether the id used for reference tracking is held in a stack slot > > > + * or in a register > > > + */ > > > +static bool id_in_stack_slot(enum bpf_arg_type arg_type) > > > > is_ or has_ is a good idea for such bool-returning helpers (similarly > > for stack_access_into_dynptr above), otherwise it reads like a verb > > and command to do something > > > > but looking few lines below, if (arg_type_is_dynptr()) would be > > clearer than extra wrapper function, not sure what's the purpose of > > the helper > My thinking behind this extra wrapper function was that it'd be more > extensible in the future if there are other types that will store > their id in the stack slot. But I think I'm over-optimizing here :) > I'll remove this wrapper function yeah, it's internal implementation, so we can always refactor > > > [...] > > > @@ -5572,6 +5758,40 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg, > > > bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO); > > > > > > err = check_mem_size_reg(env, reg, regno, zero_size_allowed, meta); > > > + } else if (arg_type_is_dynptr(arg_type)) { > > > + bool initialized = check_dynptr_init(env, reg, arg_type); > > > + > > > + if (type_is_uninit_mem(arg_type)) { > > > + if (initialized) { > > > + verbose(env, "Arg #%d dynptr cannot be an initialized dynptr\n", > > > + arg + 1); > > > + return -EINVAL; > > > + } > > > + meta->raw_mode = true; > > > + err = check_helper_mem_access(env, regno, BPF_DYNPTR_SIZE, false, meta); > > > + /* For now, we do not allow dynptrs to point to existing > > > + * refcounted memory > > > + */ > > > + if (reg_type_may_be_refcounted_or_null(regs[BPF_REG_1].type)) { > > > > hard-coded BPF_REG_1? > > I'm viewing this as a temporary line because one of the patches in a > later dynptr patchset will enable support for local dynptrs to point > to existing refcounted memory. The alternative is to add a new > bpf_type_flag like NO_REFCOUNT and then remove that flag later. What > are your thoughts? my concern and confusion was that it's a hard-coded BPF_REG_1 instead of using arg to derive register itself. Why making unnecessary assumptions about this always being a first argument? > > > > > + verbose(env, "Arg #%d dynptr memory cannot be potentially refcounted\n", > > > + arg + 1); > > > + return -EINVAL; > > > + } > > > + } else { > > > + if (!initialized) { > > > + char *err_extra = ""; > > > > const char * > > > > > + > > > + if (arg_type & DYNPTR_TYPE_LOCAL) > > > + err_extra = "local "; > > > + else if (arg_type & DYNPTR_TYPE_MALLOC) > > > + err_extra = "malloc "; > > > + verbose(env, "Expected an initialized %sdynptr as arg #%d\n", > > > + err_extra, arg + 1); > > > > what if helper accepts two or more different types of dynptr? > Currently, bpf_dynptr_read/write accept any type of dynptr so they > don't set any dynptr type flag, which means this error would just > print "Expected an initialized dynptr as arg...". But you're right > that in the future, there can be some API that accepts only a subset > (eg mallocs and ringbuffers and not local dynptrs); in this case, > maybe the simplest is just to return a generic "Expected an > initialized dynptr as arg...". Do you think this suffices or do you > think it'd be worth the effort to print out the different types of > initialized dynptrs it expects? Yeah, let's keep it simple with generic error, doing multiple string concatenations for this seems like an overkill. > > > > > + return -EINVAL; > > > + } > > > + if (type_is_release_mem(arg_type)) > > > + err = unmark_stack_slots_dynptr(env, reg); > > > + } > > > } else if (arg_type_is_alloc_size(arg_type)) { > > > if (!tnum_is_const(reg->var_off)) { > > > verbose(env, "R%d is not a known constant'\n", > > > > [...] > Thanks for your feedback, Andrii!! sure, np