On Thu, 18 Aug 2022 at 02:40, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Thu, Aug 18, 2022 at 01:51:39AM +0200, Kumar Kartikeya Dwivedi wrote: > > > + > > > +/* notrace is necessary here and in other functions to make sure > > > + * bpf programs cannot attach to them and cause llist corruptions. > > > + */ > > > +static void notrace *unit_alloc(struct bpf_mem_cache *c) > > > +{ > > > + bool in_nmi = bpf_in_nmi(); > > > + struct llist_node *llnode; > > > + unsigned long flags; > > > + int cnt = 0; > > > + > > > + if (unlikely(in_nmi)) { > > > + llnode = llist_del_first(&c->free_llist_nmi); > > > + if (llnode) > > > + cnt = atomic_dec_return(&c->free_cnt_nmi); > > > > I am trying to understand which case this > > atomic_dec_return/atomic_inc_return protects against in the > > unit_alloc/unit_free for in_nmi branch. Is it protecting nested NMI > > BPF prog interrupting NMI prog? > > > > In case of perf it seems we use bpf_prog_active, > > yes, but bpf_prog_active has plenty of downsides and hopefully > will be replaced eventually with cleaner mechanism. > Separate topic. I see. > > > so nested NMI prog > > won't be invoked while we are interrupted inside a BPF program in NMI > > context. Which are the other cases that might cause reentrancy in this > > branch such that we need atomics instead of c->free_cnt_nmi--? Or are > > you anticipating you might allow this in the future even if it is > > disallowed for now? > > > > If programs are allowed to stack like this, and we try to reason about > > the safety of llist_del_first operation, the code is: > > > > struct llist_node *llist_del_first(struct llist_head *head) > > { > > struct llist_node *entry, *old_entry, *next; > > > > entry = smp_load_acquire(&head->first); > > for (;;) { > > if (entry == NULL) > > return NULL; > > old_entry = entry; > > next = READ_ONCE(entry->next); > > >>>>>>>> Suppose nested NMI comes at this point and BPF prog is invoked. > > llist_del_first is notrace. > unit_alloc() above is also notrace. See comment before it. > perf event overflow NMI can happen here, but for some other llist. > Hence we're talking about NMI issues only here. fentry progs do not apply here. I was not thinking about fentry progs either. I saw perf overflow progs can't nest, so I was wondering whether there were any other cases. But you mentioned bpf_mem_refill in IRQ and perf in NMI can't touch the same lockless list, so the scenario is still possible. > > > Assume the current nmi free llist is HEAD -> A -> B -> C -> D -> ... > > For our cmpxchg, parameters are going to be cmpxchg(&head->first, A, B); > > > > Now, nested NMI prog does unit_alloc thrice. this does llist_del_first thrice > > Even double llist_del_first on the same llist is bad. That's a known fact. Well, if you think about it (correct me if I'm wrong), at least in this kind of nesting scenario on the same CPU, just doing llist_del_first in NMI prog which interrupts llist_del_first of bpf_mem_refill isn't a problem. The cmpxchg will fail as head->first changed. The problem occurs when you combine it with llist_add between the READ_ONCE(entry->next) and cmpxchg of the interrupted llist_del_first. The main invariant of llist_del_first is that entry->next should not change between READ_ONCE and cmpxchg, but if we construct an ABA scenario like I did in my previous reply, _then_ we have a problem. Otherwise it will just retry loop on exit if we e.g. llist_del_first and kptr_xchg the ptr (which won't do llist_add). > > > This makes nmi free llist HEAD -> D -> ... > > A, B, C are allocated in prog. > > Now it does unit_free of all three, but in order of B, C, A. > > unit_free does llist_add, nmi free llist becomes HEAD -> A -> C -> B -> D -> ... > > > > Nested NMI prog exits. > > We continue with our cmpxchg(&head->first, A, B); It succeeds, A is > > returned, but C will be leaked. > > This exact scenario cannot happen for bpf_mem_cache's freelist. > unit_alloc is doing llist_del_first on per-cpu freelist. > We can have two perf_event bpf progs. Both progs would > share the same hash map and use the same struct bpf_mem_alloc, > and both call unit_alloc() on the same cpu freelist, > but as you noticed bpf_prog_active covers that case. > bpf_prog_active is too coarse as we discussed in the other thread a > month or so ago. It prevents valid and safe execution of bpf progs, lost > events, etc. We will surely come up with a better mechanism. > > Going back to your earlier question: > > > Which are the other cases that might cause reentrancy in this > > branch such that we need atomics instead of c->free_cnt_nmi--? > > It's the case where perf_event bpf prog happened inside bpf_mem_refill in irq_work. > bpf_mem_refill manipulates free_cnt_nmi and nmi bpf prog too through unit_alloc. > Which got me thinking that there is indeed a missing check here. Aaah, ok, so this is what you wanted to prevent. Makes sense, even though NMI nesting won't happen in progs (atleast for now), this irq_work refilling can be interrupted by some perf NMI prog, or raw_tp tracing prog in NMI context. > We need to protect free_bulk_nmi's llist_del_first from unit_alloc's llist_del_first. > bpf_prog_active could be used for that, but let's come up with a cleaner way. > Probably going to add atomic_t flag to bpf_mem_cache and cmpxchg it, > or lock and spin_trylock it. tbd. Hm, can you explain why an atomic flag or lock would be needed, and not just having a small busy counter like bpf_prog_active for the NMI free llist will work? bpf_mem_cache is already per-CPU so it can just be int alongside the llist. You inc it before llist_del_first, and then assuming inc is atomic across interrupt boundary (which I think this_cpu_inc_return for bpf_prog_active is already assuming), NMI prog will see llist as busy and will fail its llist_del_first. llist_add should still be fine to allow. Technically we can fail llist_add instead, since doing multiple llist_del_first won't be an issue, but you can't fail bpf_mem_free, though you can fail bpf_mem_alloc, so it makes sense to protect only llist_del_first using the counter. PS: Also, it would be great if you could add a few words about this around this code, i.e. which contexts would nest and those are what this atomic_inc/dec and llist_del_first protection prevents problems in.