Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 9 Sep 2022 13:21:05 -0700

On Fri, Sep 9, 2022 at 12:25 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
> > On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
> > > > > > >
> > > > > > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > > > > disallows us from holding locks which are global.
> > > > > > > >
> > > > > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > > > > registers at the same time.
> > > > > > > >
> > > > > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > > > > whether bpf_spin_unlock is for the same register.
> > > > > > > >
> > > > > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > > > > than 1).
> > > > > > > >
> > > > > > >
> > > > > > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > > > > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > > > > > >
> > > > > > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > > > > > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > > > > > >   ...
> > > > > > >   spin_lock(&lock1);
> > > > > > >   ...
> > > > > > >   spin_lock(&lock2);
> > > > > > >
> > > > > > > will result in same map but different offsets for the direct read (and different
> > > > > > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > > > > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > > > > > >
> > > > > >
> > > > > > That won't be a problem. Two spin locks in a map value or datasec are
> > > > > > already rejected on BPF_MAP_CREATE,
> > > > > > so there is no bug. See idx >= info_cnt check in
> > > > > > btf_find_struct_field, btf_find_datasec_var.
> > > > > >
> > > > > > I can include offset as the third part of the tuple. The problem then
> > > > > > is figuring out which lock protects which bpf_list_head. We need
> > > > > > another __guarded_by annotation and force users to use that to
> > > > > > eliminate the ambiguity. So for now I just put it in the commit log
> > > > > > and left it for the future.
> > > > >
> > > > > Let's not go that far yet.
> > > > > Extra annotations are just as confusing and non-obvious as
> > > > > putting locks in different sections.
> > > > > Let's keep one lock per map value limitation for now.
> > > > > libbpf side needs to allow many non-mappable sections though.
> > > > > Single bss.private name is too limiting.
> > > >
> > > > In that case,
> > > > Dave, since the libbpf patch is yours, would you be fine with
> > > > reworking it to support multiple private maps?
> > > > Maybe it can just ignore the .XXX part in .bss.private.XXX?
> > > > Also I think Andrii mentioned once that he wants to eventually merge
> > > > data and bss, so it might be a good idea to call it .data.private from
> > > > the start?
> > >
> > > I'd probably make all non-canonical names to be not-mmapable.
> > > The compiler generates special sections already.
> > > Thankfully the code doesn't use them, but it will sooner or later.
> > > So libbpf has to create hidden maps for them eventually.
> > > They shouldn't be messed up from user space, since it will screw up
> > > compiler generated code.
> > >
> > > Andrii, what's your take?
> >
> > Ok, a bunch of things to unpack. We've also discussed a lot of this
> > with Dave few weeks ago, but I have also few questions.
> >
> > First, I'd like to not keep extending ".bss" with any custom ".bss.*"
> > sections. This is why we have .data.* and .rodata.* and not .bss (bad,
> > meaningless, historic name).
> >
> > But I'm totally fine dedicating some other prefix to non-mmapable data
> > sections that won't be exposed in skeleton and, well, not-mmapable.
> > What to name it depends on what we anticipate putting in them?
> >
> > If it's just for spinlocks, then having something like SEC(".locks")
> > seems best to me. If it's for more stuff, like global kptrs, rbtrees
> > and whatnot, then we'd need a bit more generic name (.private, or
> > whatever, didn't think much on best name). We can also allow .locks.*
> > or .private.* (i.e., keep it uniform with .data and .rodata handling,
> > expect for mmapable aspect).
> >
> > One benefit for having SEC(".locks") just for spin_locks is that we
> > can teach libbpf to create a multi-element ARRAY map, where each lock
> > variable is put into a separate element. From BPF verifier's
> > perspective, there will be a single BTF type describing spin lock, but
> > multiple "instances" of lock, one per each element. That seems a bit
> > magical and I think, generally speaking, it's best to start supporting
> > multiple lock declarations within single map element (and thus keep
> > track of their offset within map_value); but at least that's an
> > option.
>
> ".lock" won't work. We need lock+rb_root or lock+list_head to be
> in the same section.
> It should be up to user to name that section with something meaningful.
> Ideally something like this should be supported:
> SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
>
> > Dave had some concerns about pinning such maps and whatnot, but for
> > starters we decided to not worry about pinning for now. Dave, please
> > bring up remaining issues, if you don't mind.
>
> Pinning shouldn't be an issue.
> Only mmap is the problem. User space access if fine since kernel
> will mask out special fields on read/write.
>
> > So to answer Alexei's specific option. I'm still not in favor of just
> > saying "anything that's not .data or .rodata is non-mmapable map". I'd
> > rather carve out naming prefixes with . (which are  reserved for
> > libbpf's own use) for these special purpose maps. I don't think that
> > limits anyone, right?
>
> Is backward compat a concern?
> Whether to mmap global data is a flag.
> It can be opt-in or opt-out.
> I'm proposing make all named section to be 'do not mmap'.
> If a section needs to be mmaped and appear in skeleton the user can do
> SEC("my_section.mmap")
>
> What you're proposing is to do the other way around:
> SEC("enqueue.nommap")
> SEC("dequeue.nommap")
> in the above example.
> I guess it's fine, but more verbose.

Well, I didn't propose to use suffixes. Currently user can define
SEC(".data.my_custom_use_case"). So I was proposing that we'll just
define a different *prefix*, like SEC(".private.enqueue") and
SEC(".private.dequeue") for your example above, which will be private
to BPF program, not mmap'ed, not exposed in skeleton.

mmap is a bit orthogonal to exposing in skeleton, you can still
imagine data section that will be allowed to be initialized from
user-space before load but never mmaped afterwards. Just to say that
.nommap doesn't necessarily imply that it shouldn't be in a skeleton.

So I still prefer special prefix (.private) and declare that this is
both non-mmapable and not-exposed in skeleton.

As for allowing any section. It just feels unnecessary and long-term
harmful to allow any section name at this point, tbh.

> The gut feeling is that the use case for naming section will be specifically
> for lock+rbtree. Everything else will go into common global .data or .rodata.
> Same thinking about compiler generated special sections with constants.
> They shouldn't be mmaped by default, but we're not going to hack llvm
> to add ".nommap" suffix to such sections.
> Hence the proposal to avoid mmap by default for all non standard sections.