Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 13 Feb 2025 18:37:05 -0800

On Thu, Feb 13, 2025 at 1:59 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Feb 11, 2025 at 10:33:00AM -0800, Alexei Starovoitov wrote:
>
> > Ohh. No unpriv here.
> > Since spectre was discovered unpriv bpf died.
> > BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and
> > all hyperscalers for quite some time.
>
> Ah, okay. Time to remove the option then?

Good point. Indeed.
Will accept the patch if anyone has cycles to prep it, test it.

> > > So much details not clear to me and not explained either :/
> >
> > Yes. The plan is to "kill" bpf prog when it misbehaves.
> > But this is orthogonal to this res_spin_lock set which is
> > a building block.
> >
> > > Right, but it might have already modified things, how are you going to
> > > recover from that?
> >
> > Tracking resources acquisition and release by the bpf prog
> > is a normal verifier job.
> > When bpf prog does bpf_rcu_read_lock() the verifier makes sure
> > that all execution paths from there on have bpf_rcu_read_unlock()
> > before program reaches the exit.
> > Same thing with locks.
>
> Ah, okay, this wasn't stated anywhere. This is rather crucial
> information.

This is kinda verifier 101. I don't think it needs to be in the log.

> > We definitely don't want to bpf core to keep track of acquired resources.
> > That just doesn't scale.
> > There could be rcu_read_locks, all kinds of refcounted objects,
> > locks taken, and so on.
> > The verifier makes sure that the program does the release no matter
> > what the execution path.
> > That's how it scales.
> > On my devserver I have 152 bpf programs running.
> > All of them keep acquiring and releasing resources (locks, sockets,
> > memory) million times a second.
> > The verifier checks that each prog is doing its job individually.
>
> Well, this patch set tracks the held lock stack -- which is required in
> order to do the deadlock thing after all.

Right, but the held lock set is per-cpu global and not exhaustive.
It cannot detect 3-lock circles _by design_.
We rely on timeout for extreme cases.

> > The bpf infra does static checks only.
> > The core doesn't track objects at run-time.
> > The only exceptions are map elements.
> > bpf prog might store an acquired object in a map.
> > Only in that case bpf infra will free that object when it frees
> > the whole map.
> > But that doesn't apply to short lived things like RCU CS and
> > locks. Those cannot last long. They must complete within single
> > execution of the prog.
>
> Right. Held lock stack is like that.

They're not equivalent and not used for correctness.
See patch 26 and res_spin_lock_test_held_lock_max() selftest
that was added specifically to overwhelm:

+struct rqspinlock_held {
+   int cnt;
+   void *locks[RES_NR_HELD];
+};

It's an impossible case in reality, but the res_spin_lock
code should be prepared for extreme cases like that.

Just like existing qspinlock has 4 percpu qnodes and
test-and-set fallback in case "if (unlikely(idx >= MAX_NODES))"
line qspinlock.c:413.
Can it happen in practice ? Probably never.
But the code has to be ready to handle it.

> > > > That was a conscious trade-off. Deadlocks are not normal.
> > >
> > > I really do think you should assume they are normal, unpriv and all
> > > that.
> >
> > No unpriv and no, we don't want deadlocks to be considered normal
> > by bpf users. They need to hear "fix your broken prog" message loud
> > and clear. Patch 14 splat is a step in that direction.
> > Currently it's only for in-kernel res_spin_lock() usage
> > (like in bpf hashtab). Eventually we will deliver the message to users
> > without polluting dmesg. Still debating the actual mechanism.
>
> OK; how is the user supposed to handle locking two hash buckets? Does
> the BPF prog create some global lock to serialize the multi bucket case?

Not following.
Are you talking about patch 19 where we convert per-bucket
raw_spinlock_t in bpf hashmap to rqspinlock_t ?
Only one bucket lock is held at a time by map update code,
but due to reentrance and crazy kprobes in the wrong places
two bucket locks of a single map can be held on the same cpu.

bpf_prog_A -> bpf_map_update -> res_spin_lock(bucket_A)
  -> kprobe or tracepoint
    -> bpf_prob_B -> bpf_map_update -> res_spin_lock(bucket_B)

and that's why we currently have:
if (__this_cpu_inc_return(*(htab->map_locked[hash])) ...
    return -EBUSY;

.. workaround to prevent the most obvious AA deadlock,
but it's not enough.
People were able to hit ABBA.
Note, raw_spin_lock today (and res_spin_lock after patch 19) is
used by proper kernel code in kernel/bpf/hashtab.c.
bpf prog just calls bpf_map_update() which is a normal
helper call from the verifier point of view.
It doesn't know whether there are locks inside or not.

bpf_ktime_get_ns() helper is similar.
The verifier knows that it's safe from NMI,
but what kinds of locks inside it doesn't care.

> Anyway, I wonder. Since the verifier tracks all this, it can determine
> lock order for the prog. Can't it do what lockdep does and maintain lock
> order graph of all loaded BPF programs?
>
> This is load-time overhead, rather than runtime.

I wish it was possible. Locks are dynamic. They protect
dynamically allocated objects, so the order cannot be statically
verified. We pushed the limit of static analysis a lot.
Maybe too much.
For example,
the verifier can statically validate the following code:
        struct node_data *n, *m, *o;
        struct bpf_rb_node *res, *res2;

        // here we allocate an object of type known to the verifier
        n = bpf_obj_new(typeof(*n));
        if (!n)
                return 1;
        n->key = 41;
        n->data = 42;

        // here the verifier knows that glock spin_lock
        // protect rbtree groot
        bpf_spin_lock(&glock);

        // here it checks that the lock is held and type of
        // objects in rbtree matches the type of 'n'
        bpf_rbtree_add(&groot, &n->node, less);
        bpf_spin_unlock(&glock);

and all kinds of other more complex stuff,
but it is not enough to cover necessary algorithms.

Here is an example from real code that shows
why we cannot verify two held locks:

struct bpf_vqueue {
        struct bpf_spin_lock lock;
        int credit;
        unsigned long long lasttime;
        unsigned int rate;
};

struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, ...);
        __type(key, int);
        __type(value, struct bpf_vqueue);
} vqueue SEC(".maps");

        q = bpf_map_lookup_elem(&vqueue, &key);
        if (!q)
                goto err;
        curtime = bpf_ktime_get_ns();
        bpf_spin_lock(&q->lock);
        q->lasttime = curtime;
        q->credit -= ...;
        credit = q->credit;
        bpf_spin_unlock(&q->lock);

the above is safe, but if there are two lookups:

q1 = bpf_map_lookup_elem(&vqueue, &key1);
q2 = bpf_map_lookup_elem(&vqueue, &key2);

both will point to two different locks,
and since the key is dynamic there is no way to know
the order of q1->lock vs q2->lock.
So we allow only one lock at a time with
bare minimal operations while holding the lock,
but it's not enough to do any meaningful work.

The top feature request is to allow calls
while holding locks (currently they're disallowed,
like above bpf_ktime_get_ns() cannot be done
while holding the lock)
and allow grabbing more than one lock.
That's what res_spin_lock() is achieving.

Having said all that I think the discussion is diverging into
all-thing-bpf instead of focusing on res_spin_lock.

Just to make it clear... there is a patch 18:

 F: kernel/bpf/
 F: kernel/trace/bpf_trace.c
 F: lib/buildid.c
+F: arch/*/include/asm/rqspinlock.h
+F: include/asm-generic/rqspinlock.h
+F: kernel/locking/rqspinlock.c
 F: lib/test_bpf.c
 F: net/bpf/

that adds maintainer entries to BPF scope.

We're not asking locking experts to maintain this new res_spin_lock.
It's not a generic kernel infra.
It will only be used by bpf infra and by bpf progs.
We will maintain it and we will fix whatever bugs
we introduce.

We can place it in kernel/bpf/rqspinlock.c
to make things more obvious,
but kernel/locking/ feels a bit cleaner.

We're not asking to review patches 14 and higher.
They are presented for completeness.
(patch 17 was out-of-order. It will be moved sooner. Sorry about that)

But welcome feedback for patches 1-13.
Like the way you spotted broken smp_cond_load_acquire()
on arm64 due to WFE.
That was a great catch. We really appreciate it.