Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables

Dave Marchevsky <davemarchevsky@xxxxxx> · Fri, 9 Sep 2022 18:30:29 -0400

On 9/9/22 3:25 PM, Alexei Starovoitov wrote:
> On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
>> On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
>> <alexei.starovoitov@xxxxxxxxx> wrote:
>>>
>>> On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
>>>>
>>>> On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
>>>> <alexei.starovoitov@xxxxxxxxx> wrote:
>>>>>
>>>>> On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
>>>>>>
>>>>>> On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>>>>>>>
>>>>>>> On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
>>>>>>>> Global variables reside in maps accessible using direct_value_addr
>>>>>>>> callbacks, so giving each load instruction's rewrite a unique reg->id
>>>>>>>> disallows us from holding locks which are global.
>>>>>>>>
>>>>>>>> This is not great, so refactor the active_spin_lock into two separate
>>>>>>>> fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
>>>>>>>> enough to allow it for global variables, map lookups, and local kptr
>>>>>>>> registers at the same time.
>>>>>>>>
>>>>>>>> Held vs non-held is indicated by active_spin_lock_ptr, which stores the
>>>>>>>> reg->map_ptr or reg->btf pointer of the register used for locking spin
>>>>>>>> lock. But the active_spin_lock_id also needs to be compared to ensure
>>>>>>>> whether bpf_spin_unlock is for the same register.
>>>>>>>>
>>>>>>>> Next, pseudo load instructions are not given a unique reg->id, as they
>>>>>>>> are doing lookup for the same map value (max_entries is never greater
>>>>>>>> than 1).
>>>>>>>>
>>>>>>>
>>>>>>> For libbpf-style "internal maps" - like .bss.private further in this series -
>>>>>>> all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
>>>>>>>
>>>>>>>   struct bpf_spin_lock lock1 SEC(".bss.private");
>>>>>>>   struct bpf_spin_lock lock2 SEC(".bss.private");
>>>>>>>   ...
>>>>>>>   spin_lock(&lock1);
>>>>>>>   ...
>>>>>>>   spin_lock(&lock2);
>>>>>>>
>>>>>>> will result in same map but different offsets for the direct read (and different
>>>>>>> aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
>>>>>>> this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
>>>>>>>
>>>>>>
>>>>>> That won't be a problem. Two spin locks in a map value or datasec are
>>>>>> already rejected on BPF_MAP_CREATE,
>>>>>> so there is no bug. See idx >= info_cnt check in
>>>>>> btf_find_struct_field, btf_find_datasec_var.
>>>>>>
>>>>>> I can include offset as the third part of the tuple. The problem then
>>>>>> is figuring out which lock protects which bpf_list_head. We need
>>>>>> another __guarded_by annotation and force users to use that to
>>>>>> eliminate the ambiguity. So for now I just put it in the commit log
>>>>>> and left it for the future.
>>>>>
>>>>> Let's not go that far yet.
>>>>> Extra annotations are just as confusing and non-obvious as
>>>>> putting locks in different sections.
>>>>> Let's keep one lock per map value limitation for now.
>>>>> libbpf side needs to allow many non-mappable sections though.
>>>>> Single bss.private name is too limiting.
>>>>
>>>> In that case,
>>>> Dave, since the libbpf patch is yours, would you be fine with
>>>> reworking it to support multiple private maps?
>>>> Maybe it can just ignore the .XXX part in .bss.private.XXX?
>>>> Also I think Andrii mentioned once that he wants to eventually merge
>>>> data and bss, so it might be a good idea to call it .data.private from
>>>> the start?
>>>
>>> I'd probably make all non-canonical names to be not-mmapable.
>>> The compiler generates special sections already.
>>> Thankfully the code doesn't use them, but it will sooner or later.
>>> So libbpf has to create hidden maps for them eventually.
>>> They shouldn't be messed up from user space, since it will screw up
>>> compiler generated code.
>>>
>>> Andrii, what's your take?
>>
>> Ok, a bunch of things to unpack. We've also discussed a lot of this
>> with Dave few weeks ago, but I have also few questions.
>>
>> First, I'd like to not keep extending ".bss" with any custom ".bss.*"
>> sections. This is why we have .data.* and .rodata.* and not .bss (bad,
>> meaningless, historic name).
>>
>> But I'm totally fine dedicating some other prefix to non-mmapable data
>> sections that won't be exposed in skeleton and, well, not-mmapable.
>> What to name it depends on what we anticipate putting in them?
>>
>> If it's just for spinlocks, then having something like SEC(".locks")
>> seems best to me. If it's for more stuff, like global kptrs, rbtrees
>> and whatnot, then we'd need a bit more generic name (.private, or
>> whatever, didn't think much on best name). We can also allow .locks.*
>> or .private.* (i.e., keep it uniform with .data and .rodata handling,
>> expect for mmapable aspect).
>>
>> One benefit for having SEC(".locks") just for spin_locks is that we
>> can teach libbpf to create a multi-element ARRAY map, where each lock
>> variable is put into a separate element. From BPF verifier's
>> perspective, there will be a single BTF type describing spin lock, but
>> multiple "instances" of lock, one per each element. That seems a bit
>> magical and I think, generally speaking, it's best to start supporting
>> multiple lock declarations within single map element (and thus keep
>> track of their offset within map_value); but at least that's an
>> option.
> 
> ".lock" won't work. We need lock+rb_root or lock+list_head to be
> in the same section.
> It should be up to user to name that section with something meaningful.
> Ideally something like this should be supported:
> SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
> 

Isn't the "head and lock must be in same section / map_value" desired, or just
a consequence of this implementation? I don't see why it's desirable from user
perspective. Seems to have same problem as rbtree RFCv1's rbtree_map struct
creating its own bpf_spin_lock, namely not providing a way for multiple
datastructures to share same lock in a way that makes sense to the verifier for
enforcement.

>> Dave had some concerns about pinning such maps and whatnot, but for
>> starters we decided to not worry about pinning for now. Dave, please
>> bring up remaining issues, if you don't mind.
> 

@Andrii, aside from vague pinning concerns from our last discussion about this,
I don't have any specific concerns. A multi-element ".locks" is more
appealing to me now, actually, as I think it enables best-of-both-worlds for
this impl and my rbtree RFCv2 experiments:

  * This series uses (map_ptr, map_value_offset) as lock identity for
    verification purposes and expects map_ptr for list_head and lock
    to be the same.
    * If my logic in comment preceding this one is correct, downside
      is no lock sharing between datastructures.

  * rbtree RFCv2 uses lock address as lock identity
    for verification purposes and requires lock address to be known
    when verifying program using the lock.
    * Downside: no clear path forward for map_in_map general case,
      can make it work for some specific cases but kludgey.

  * If ".locks" exists, supporting multiple lock definitions, we can
    use locks_sec_offset or locks_sec_map_{key,idx} as lock identity
    for verification purposes.
    * As a result "head and lock must be in same section" requirement
      is removed, and there's a path forward for map_in_map inner maps
      to share locks arbitrarily without losing verifiability.
    * But I suspect this requires some special handling of the map backing
      ".locks" on kernel side.

I have some hacks on top of rbtree RFCv2 that are moving in this ".locks"
direction, happy to fix them up and send something if I didn't miss anything
above.

Regardless, @Kumar, happy to iterate on .bss.private patch until it's in
a shape that satisfies everyone.

> Pinning shouldn't be an issue.
> Only mmap is the problem. User space access if fine since kernel
> will mask out special fields on read/write.
> 
>> So to answer Alexei's specific option. I'm still not in favor of just
>> saying "anything that's not .data or .rodata is non-mmapable map". I'd
>> rather carve out naming prefixes with . (which are  reserved for
>> libbpf's own use) for these special purpose maps. I don't think that
>> limits anyone, right?
> 
> Is backward compat a concern?
> Whether to mmap global data is a flag.
> It can be opt-in or opt-out.
> I'm proposing make all named section to be 'do not mmap'.
> If a section needs to be mmaped and appear in skeleton the user can do
> SEC("my_section.mmap")
> 
> What you're proposing is to do the other way around:
> SEC("enqueue.nommap")
> SEC("dequeue.nommap")
> in the above example.
> I guess it's fine, but more verbose.
> The gut feeling is that the use case for naming section will be specifically
> for lock+rbtree. Everything else will go into common global .data or .rodata.
> Same thinking about compiler generated special sections with constants.
> They shouldn't be mmaped by default, but we're not going to hack llvm
> to add ".nommap" suffix to such sections.
> Hence the proposal to avoid mmap by default for all non standard sections.