Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 3 May 2023 11:48:41 -0700

On Sat, Apr 29, 2023 at 06:12:12PM +0800, Hou Tao wrote:
> +
> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	/* In case a NMI-context bpf program is also freeing object. */
> +	if (local_inc_return(&c->active) == 1) {
> +		bool try_queue_work = false;
> +
> +		/* kworker may remove elements from prepare_reuse_head */
> +		raw_spin_lock(&c->reuse_lock);
> +		if (llist_empty(&c->prepare_reuse_head))
> +			c->prepare_reuse_tail = llnode;
> +		__llist_add(llnode, &c->prepare_reuse_head);
> +		if (++c->prepare_reuse_cnt > c->high_watermark) {
> +			/* Zero out prepare_reuse_cnt early to prevent
> +			 * unnecessary queue_work().
> +			 */
> +			c->prepare_reuse_cnt = 0;
> +			try_queue_work = true;
> +		}
> +		raw_spin_unlock(&c->reuse_lock);
> +
> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
> +			/* Use reuse_cb_in_progress to indicate there is
> +			 * inflight reuse kworker or reuse RCU callback.
> +			 */
> +			atomic_inc(&c->reuse_cb_in_progress);
> +			/* Already queued */
> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))

As Martin pointed out queue_work() is not safe here.
The raw_spin_lock(&c->reuse_lock); earlier is not safe either.
For the next version please drop workers and spin_lock from unit_free/alloc paths.
If lock has to be taken it should be done from irq_work.
Under no circumstances we can use alloc_workqueue(). No new kthreads.

We can avoid adding new flag to bpf_mem_alloc to reduce the complexity
and do roughly equivalent of REUSE_AFTER_RCU_GP unconditionally in the following way:

- alloc_bulk() won't be trying to steal from c->free_by_rcu.

- do_call_rcu() does call_rcu(&c->rcu, __free_rcu) instead of task-trace version.

- rcu_trace_implies_rcu_gp() is never used.

- after RCU_GP __free_rcu() moves all waiting_for_gp elements into 
  a size specific link list per bpf_mem_alloc (not per bpf_mem_cache which is per-cpu)
  and does call_rcu_tasks_trace

- Let's call this list ma->free_by_rcu_tasks_trace
  (only one list for bpf_mem_alloc with known size or NUM_CACHES such lists when size == 0 at init)

- any cpu alloc_bulk() can steal from size specific ma->free_by_rcu_tasks_trace list that
  is protected by ma->spin_lock (1 or NUM_CACHES such locks)

- ma->waiting_for_gp_tasks_trace will be freeing elements into slab

What it means that sleepable progs using hashmap will be able to avoid uaf with bpf_rcu_read_lock().
Without explicit bpf_rcu_read_lock() it's still safe and equivalent to existing behavior of bpf_mem_alloc.
(while your proposed BPF_MA_FREE_AFTER_RCU_GP flavor is not safe to use in hashtab with sleepable progs)

After that we can unconditionally remove rcu_head/call_rcu from bpf_cpumask and improve usability of bpf_obj_drop.
Probably usage of bpf_mem_alloc in local storage can be simplified as well.
Martin wdyt?

I think this approach adds minimal complexity to bpf_mem_alloc while solving all existing pain points
including needs of qp-trie.