Re: [PATCH v2 bpf-next 01/12] bpf: Introduce any context BPF specific memory allocator.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 18 Aug 2022 at 02:40, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Aug 18, 2022 at 01:51:39AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > +
> > > +/* notrace is necessary here and in other functions to make sure
> > > + * bpf programs cannot attach to them and cause llist corruptions.
> > > + */
> > > +static void notrace *unit_alloc(struct bpf_mem_cache *c)
> > > +{
> > > +       bool in_nmi = bpf_in_nmi();
> > > +       struct llist_node *llnode;
> > > +       unsigned long flags;
> > > +       int cnt = 0;
> > > +
> > > +       if (unlikely(in_nmi)) {
> > > +               llnode = llist_del_first(&c->free_llist_nmi);
> > > +               if (llnode)
> > > +                       cnt = atomic_dec_return(&c->free_cnt_nmi);
> >
> > I am trying to understand which case this
> > atomic_dec_return/atomic_inc_return protects against in the
> > unit_alloc/unit_free for in_nmi branch. Is it protecting nested NMI
> > BPF prog interrupting NMI prog?
> >
> > In case of perf it seems we use bpf_prog_active,
>
> yes, but bpf_prog_active has plenty of downsides and hopefully
> will be replaced eventually with cleaner mechanism.
> Separate topic.

I see.

>
> > so nested NMI prog
> > won't be invoked while we are interrupted inside a BPF program in NMI
> > context. Which are the other cases that might cause reentrancy in this
> > branch such that we need atomics instead of c->free_cnt_nmi--? Or are
> > you anticipating you might allow this in the future even if it is
> > disallowed for now?
> >
> > If programs are allowed to stack like this, and we try to reason about
> > the safety of llist_del_first operation, the code is:
> >
> > struct llist_node *llist_del_first(struct llist_head *head)
> > {
> >      struct llist_node *entry, *old_entry, *next;
> >
> >      entry = smp_load_acquire(&head->first);
> >      for (;;) {
> >          if (entry == NULL)
> >              return NULL;
> >          old_entry = entry;
> >          next = READ_ONCE(entry->next);
> > >>>>>>>> Suppose nested NMI comes at this point and BPF prog is invoked.
>
> llist_del_first is notrace.
> unit_alloc() above is also notrace. See comment before it.
> perf event overflow NMI can happen here, but for some other llist.
> Hence we're talking about NMI issues only here. fentry progs do not apply here.

I was not thinking about fentry progs either. I saw perf overflow
progs can't nest, so I was wondering whether there were any other
cases. But you mentioned bpf_mem_refill in IRQ and perf in NMI can't
touch the same lockless list, so the scenario is still possible.

>
> > Assume the current nmi free llist is HEAD -> A -> B -> C -> D -> ...
> > For our cmpxchg, parameters are going to be cmpxchg(&head->first, A, B);
> >
> > Now, nested NMI prog does unit_alloc thrice. this does llist_del_first thrice
>
> Even double llist_del_first on the same llist is bad. That's a known fact.

Well, if you think about it (correct me if I'm wrong), at least in
this kind of nesting scenario on the same CPU, just doing
llist_del_first in NMI prog which interrupts llist_del_first of
bpf_mem_refill isn't a problem. The cmpxchg will fail as head->first
changed. The problem occurs when you combine it with llist_add between
the READ_ONCE(entry->next) and cmpxchg of the interrupted
llist_del_first. The main invariant of llist_del_first is that
entry->next should not change between READ_ONCE and cmpxchg, but if we
construct an ABA scenario like I did in my previous reply, _then_ we
have a problem. Otherwise it will just retry loop on exit if we e.g.
llist_del_first and kptr_xchg the ptr (which won't do llist_add).

>
> > This makes nmi free llist HEAD -> D -> ...
> > A, B, C are allocated in prog.
> > Now it does unit_free of all three, but in order of B, C, A.
> > unit_free does llist_add, nmi free llist becomes HEAD -> A -> C -> B -> D -> ...
> >
> > Nested NMI prog exits.
> > We continue with our cmpxchg(&head->first, A, B); It succeeds, A is
> > returned, but C will be leaked.
>
> This exact scenario cannot happen for bpf_mem_cache's freelist.
> unit_alloc is doing llist_del_first on per-cpu freelist.
> We can have two perf_event bpf progs. Both progs would
> share the same hash map and use the same struct bpf_mem_alloc,
> and both call unit_alloc() on the same cpu freelist,
> but as you noticed bpf_prog_active covers that case.
> bpf_prog_active is too coarse as we discussed in the other thread a
> month or so ago. It prevents valid and safe execution of bpf progs, lost
> events, etc. We will surely come up with a better mechanism.
>
> Going back to your earlier question:
>
> > Which are the other cases that might cause reentrancy in this
> > branch such that we need atomics instead of c->free_cnt_nmi--?
>
> It's the case where perf_event bpf prog happened inside bpf_mem_refill in irq_work.
> bpf_mem_refill manipulates free_cnt_nmi and nmi bpf prog too through unit_alloc.
> Which got me thinking that there is indeed a missing check here.

Aaah, ok, so this is what you wanted to prevent. Makes sense, even
though NMI nesting won't happen in progs (atleast for now), this
irq_work refilling can be interrupted by some perf NMI prog, or raw_tp
tracing prog in NMI context.

> We need to protect free_bulk_nmi's llist_del_first from unit_alloc's llist_del_first.
> bpf_prog_active could be used for that, but let's come up with a cleaner way.
> Probably going to add atomic_t flag to bpf_mem_cache and cmpxchg it,
> or lock and spin_trylock it. tbd.

Hm, can you explain why an atomic flag or lock would be needed, and
not just having a small busy counter like bpf_prog_active for the NMI
free llist will work? bpf_mem_cache is already per-CPU so it can just
be int alongside the llist. You inc it before llist_del_first, and
then assuming inc is atomic across interrupt boundary (which I think
this_cpu_inc_return for bpf_prog_active is already assuming), NMI prog
will see llist as busy and will fail its llist_del_first.
llist_add should still be fine to allow.

Technically we can fail llist_add instead, since doing multiple
llist_del_first won't be an issue, but you can't fail bpf_mem_free,
though you can fail bpf_mem_alloc, so it makes sense to protect only
llist_del_first using the counter.

PS: Also, it would be great if you could add a few words about this
around this code, i.e. which contexts would nest and those are what
this atomic_inc/dec and llist_del_first protection prevents problems
in.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux