On 9/9/22 3:25 PM, Alexei Starovoitov wrote: > On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote: >> On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov >> <alexei.starovoitov@xxxxxxxxx> wrote: >>> >>> On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: >>>> >>>> On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov >>>> <alexei.starovoitov@xxxxxxxxx> wrote: >>>>> >>>>> On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: >>>>>> >>>>>> On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@xxxxxx> wrote: >>>>>>> >>>>>>> On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote: >>>>>>>> Global variables reside in maps accessible using direct_value_addr >>>>>>>> callbacks, so giving each load instruction's rewrite a unique reg->id >>>>>>>> disallows us from holding locks which are global. >>>>>>>> >>>>>>>> This is not great, so refactor the active_spin_lock into two separate >>>>>>>> fields, active_spin_lock_ptr and active_spin_lock_id, which is generic >>>>>>>> enough to allow it for global variables, map lookups, and local kptr >>>>>>>> registers at the same time. >>>>>>>> >>>>>>>> Held vs non-held is indicated by active_spin_lock_ptr, which stores the >>>>>>>> reg->map_ptr or reg->btf pointer of the register used for locking spin >>>>>>>> lock. But the active_spin_lock_id also needs to be compared to ensure >>>>>>>> whether bpf_spin_unlock is for the same register. >>>>>>>> >>>>>>>> Next, pseudo load instructions are not given a unique reg->id, as they >>>>>>>> are doing lookup for the same map value (max_entries is never greater >>>>>>>> than 1). >>>>>>>> >>>>>>> >>>>>>> For libbpf-style "internal maps" - like .bss.private further in this series - >>>>>>> all the SEC(".bss.private") vars are globbed together into one map_value. e.g. >>>>>>> >>>>>>> struct bpf_spin_lock lock1 SEC(".bss.private"); >>>>>>> struct bpf_spin_lock lock2 SEC(".bss.private"); >>>>>>> ... >>>>>>> spin_lock(&lock1); >>>>>>> ... >>>>>>> spin_lock(&lock2); >>>>>>> >>>>>>> will result in same map but different offsets for the direct read (and different >>>>>>> aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like >>>>>>> this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id). >>>>>>> >>>>>> >>>>>> That won't be a problem. Two spin locks in a map value or datasec are >>>>>> already rejected on BPF_MAP_CREATE, >>>>>> so there is no bug. See idx >= info_cnt check in >>>>>> btf_find_struct_field, btf_find_datasec_var. >>>>>> >>>>>> I can include offset as the third part of the tuple. The problem then >>>>>> is figuring out which lock protects which bpf_list_head. We need >>>>>> another __guarded_by annotation and force users to use that to >>>>>> eliminate the ambiguity. So for now I just put it in the commit log >>>>>> and left it for the future. >>>>> >>>>> Let's not go that far yet. >>>>> Extra annotations are just as confusing and non-obvious as >>>>> putting locks in different sections. >>>>> Let's keep one lock per map value limitation for now. >>>>> libbpf side needs to allow many non-mappable sections though. >>>>> Single bss.private name is too limiting. >>>> >>>> In that case, >>>> Dave, since the libbpf patch is yours, would you be fine with >>>> reworking it to support multiple private maps? >>>> Maybe it can just ignore the .XXX part in .bss.private.XXX? >>>> Also I think Andrii mentioned once that he wants to eventually merge >>>> data and bss, so it might be a good idea to call it .data.private from >>>> the start? >>> >>> I'd probably make all non-canonical names to be not-mmapable. >>> The compiler generates special sections already. >>> Thankfully the code doesn't use them, but it will sooner or later. >>> So libbpf has to create hidden maps for them eventually. >>> They shouldn't be messed up from user space, since it will screw up >>> compiler generated code. >>> >>> Andrii, what's your take? >> >> Ok, a bunch of things to unpack. We've also discussed a lot of this >> with Dave few weeks ago, but I have also few questions. >> >> First, I'd like to not keep extending ".bss" with any custom ".bss.*" >> sections. This is why we have .data.* and .rodata.* and not .bss (bad, >> meaningless, historic name). >> >> But I'm totally fine dedicating some other prefix to non-mmapable data >> sections that won't be exposed in skeleton and, well, not-mmapable. >> What to name it depends on what we anticipate putting in them? >> >> If it's just for spinlocks, then having something like SEC(".locks") >> seems best to me. If it's for more stuff, like global kptrs, rbtrees >> and whatnot, then we'd need a bit more generic name (.private, or >> whatever, didn't think much on best name). We can also allow .locks.* >> or .private.* (i.e., keep it uniform with .data and .rodata handling, >> expect for mmapable aspect). >> >> One benefit for having SEC(".locks") just for spin_locks is that we >> can teach libbpf to create a multi-element ARRAY map, where each lock >> variable is put into a separate element. From BPF verifier's >> perspective, there will be a single BTF type describing spin lock, but >> multiple "instances" of lock, one per each element. That seems a bit >> magical and I think, generally speaking, it's best to start supporting >> multiple lock declarations within single map element (and thus keep >> track of their offset within map_value); but at least that's an >> option. > > ".lock" won't work. We need lock+rb_root or lock+list_head to be > in the same section. > It should be up to user to name that section with something meaningful. > Ideally something like this should be supported: > SEC("enqueue") struct bpf_spin_lock enqueue_lock; > SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node); > SEC("dequeue") struct bpf_spin_lock dequeue_lock; > SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node); > Isn't the "head and lock must be in same section / map_value" desired, or just a consequence of this implementation? I don't see why it's desirable from user perspective. Seems to have same problem as rbtree RFCv1's rbtree_map struct creating its own bpf_spin_lock, namely not providing a way for multiple datastructures to share same lock in a way that makes sense to the verifier for enforcement. >> Dave had some concerns about pinning such maps and whatnot, but for >> starters we decided to not worry about pinning for now. Dave, please >> bring up remaining issues, if you don't mind. > @Andrii, aside from vague pinning concerns from our last discussion about this, I don't have any specific concerns. A multi-element ".locks" is more appealing to me now, actually, as I think it enables best-of-both-worlds for this impl and my rbtree RFCv2 experiments: * This series uses (map_ptr, map_value_offset) as lock identity for verification purposes and expects map_ptr for list_head and lock to be the same. * If my logic in comment preceding this one is correct, downside is no lock sharing between datastructures. * rbtree RFCv2 uses lock address as lock identity for verification purposes and requires lock address to be known when verifying program using the lock. * Downside: no clear path forward for map_in_map general case, can make it work for some specific cases but kludgey. * If ".locks" exists, supporting multiple lock definitions, we can use locks_sec_offset or locks_sec_map_{key,idx} as lock identity for verification purposes. * As a result "head and lock must be in same section" requirement is removed, and there's a path forward for map_in_map inner maps to share locks arbitrarily without losing verifiability. * But I suspect this requires some special handling of the map backing ".locks" on kernel side. I have some hacks on top of rbtree RFCv2 that are moving in this ".locks" direction, happy to fix them up and send something if I didn't miss anything above. Regardless, @Kumar, happy to iterate on .bss.private patch until it's in a shape that satisfies everyone. > Pinning shouldn't be an issue. > Only mmap is the problem. User space access if fine since kernel > will mask out special fields on read/write. > >> So to answer Alexei's specific option. I'm still not in favor of just >> saying "anything that's not .data or .rodata is non-mmapable map". I'd >> rather carve out naming prefixes with . (which are reserved for >> libbpf's own use) for these special purpose maps. I don't think that >> limits anyone, right? > > Is backward compat a concern? > Whether to mmap global data is a flag. > It can be opt-in or opt-out. > I'm proposing make all named section to be 'do not mmap'. > If a section needs to be mmaped and appear in skeleton the user can do > SEC("my_section.mmap") > > What you're proposing is to do the other way around: > SEC("enqueue.nommap") > SEC("dequeue.nommap") > in the above example. > I guess it's fine, but more verbose. > The gut feeling is that the use case for naming section will be specifically > for lock+rbtree. Everything else will go into common global .data or .rodata. > Same thinking about compiler generated special sections with constants. > They shouldn't be mmaped by default, but we're not going to hack llvm > to add ".nommap" suffix to such sections. > Hence the proposal to avoid mmap by default for all non standard sections.