Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 04, 2023 at 09:35:17AM +0800, Hou Tao wrote:
> Hi,
> 
> On 5/4/2023 2:48 AM, Alexei Starovoitov wrote:
> > On Sat, Apr 29, 2023 at 06:12:12PM +0800, Hou Tao wrote:
> >> +
> >> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
> >> +{
> >> +	unsigned long flags;
> >> +
> >> +	local_irq_save(flags);
> >> +	/* In case a NMI-context bpf program is also freeing object. */
> >> +	if (local_inc_return(&c->active) == 1) {
> >> +		bool try_queue_work = false;
> >> +
> >> +		/* kworker may remove elements from prepare_reuse_head */
> >> +		raw_spin_lock(&c->reuse_lock);
> >> +		if (llist_empty(&c->prepare_reuse_head))
> >> +			c->prepare_reuse_tail = llnode;
> >> +		__llist_add(llnode, &c->prepare_reuse_head);
> >> +		if (++c->prepare_reuse_cnt > c->high_watermark) {
> >> +			/* Zero out prepare_reuse_cnt early to prevent
> >> +			 * unnecessary queue_work().
> >> +			 */
> >> +			c->prepare_reuse_cnt = 0;
> >> +			try_queue_work = true;
> >> +		}
> >> +		raw_spin_unlock(&c->reuse_lock);
> >> +
> >> +		if (try_queue_work && !work_pending(&c->reuse_work)) {
> >> +			/* Use reuse_cb_in_progress to indicate there is
> >> +			 * inflight reuse kworker or reuse RCU callback.
> >> +			 */
> >> +			atomic_inc(&c->reuse_cb_in_progress);
> >> +			/* Already queued */
> >> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
> > As Martin pointed out queue_work() is not safe here.
> > The raw_spin_lock(&c->reuse_lock); earlier is not safe either.
> I see. Didn't recognize these problems.
> > For the next version please drop workers and spin_lock from unit_free/alloc paths.
> > If lock has to be taken it should be done from irq_work.
> > Under no circumstances we can use alloc_workqueue(). No new kthreads.
> Is there any reason to prohibit the use of new kthread in irq_work ?

Because:
1. there is a workable solution without kthreads.
2. if there was no solution we would have to come up with one.
kthread is not an answer. It's hard to reason about a setup when kthreads
are in critical path due to scheduler. Assume the system is 100% cpu loaded.
kthreads delays and behavior is unpredictable. We cannot subject memory alloc/free to it.

> >
> > We can avoid adding new flag to bpf_mem_alloc to reduce the complexity
> > and do roughly equivalent of REUSE_AFTER_RCU_GP unconditionally in the following way:
> >
> > - alloc_bulk() won't be trying to steal from c->free_by_rcu.
> >
> > - do_call_rcu() does call_rcu(&c->rcu, __free_rcu) instead of task-trace version.
> No sure whether or not one inflight RCU callback is enough. Will check.
> If one is not enough, I may use kmalloc(__GFP_NOWAIT) in irq work to
> allocate multiple RCU callbacks.

Pls dont. Just assume it will work, implement the proposal (if you agree),
come back with the numbers and then we will discuss again.
We cannot keep arguing about merits of complicated patch set that was done on partial data.
Just like the whole thing with kthreads.
I requested early on: "pls no kthreads" and weeks later we're still arguing.

> > - rcu_trace_implies_rcu_gp() is never used.
> >
> > - after RCU_GP __free_rcu() moves all waiting_for_gp elements into 
> >   a size specific link list per bpf_mem_alloc (not per bpf_mem_cache which is per-cpu)
> >   and does call_rcu_tasks_trace
> >
> > - Let's call this list ma->free_by_rcu_tasks_trace
> >   (only one list for bpf_mem_alloc with known size or NUM_CACHES such lists when size == 0 at init)
> >
> > - any cpu alloc_bulk() can steal from size specific ma->free_by_rcu_tasks_trace list that
> >   is protected by ma->spin_lock (1 or NUM_CACHES such locks)
> To reduce the lock contention, alloc_bulk() can steal from the global
> list in batch. 

Pls no special batches. The simplest implementation possible.
alloc_bulk() has 'int cnt' argument. It will try to steal 'cnt' from ma->free_by_rcu_tasks_trace.

> Had tried the global list before but I didn't do the
> concurrent freeing, I think it could reduce the risk of OOM for
> add_del_on_diff_cpu.

Maybe you've tried, but we didn't see the patches and we cannot take for granted
anyone saying: "I've tried *foo*. It didn't work. That's why I'm doing *bar* here".
Everything mm is tricky. Little details matter a lot.
It's also questionable whether we should make any design decisions based on this benchmark
and in particular based on add_del_on_diff_cpu part of it.
I'm not saying we shouldn't consider it, but all numbers have a "decision weight"
associated with them.
For example: there is existing samples/bpf/map_perf_test benchmark.
So far we haven't seen the numbers from it.
Is it more important than your new bench? Yes and no. All numbers matter.

> >
> > - ma->waiting_for_gp_tasks_trace will be freeing elements into slab
> >
> > What it means that sleepable progs using hashmap will be able to avoid uaf with bpf_rcu_read_lock().
> > Without explicit bpf_rcu_read_lock() it's still safe and equivalent to existing behavior of bpf_mem_alloc.
> > (while your proposed BPF_MA_FREE_AFTER_RCU_GP flavor is not safe to use in hashtab with sleepable progs)
> >
> > After that we can unconditionally remove rcu_head/call_rcu from bpf_cpumask and improve usability of bpf_obj_drop.
> > Probably usage of bpf_mem_alloc in local storage can be simplified as well.
> > Martin wdyt?
> >
> > I think this approach adds minimal complexity to bpf_mem_alloc while solving all existing pain points
> > including needs of qp-trie.
> Thanks for these great suggestions. Will try to do it in v4.

Thanks.
Also for benchmark, pls don't hack htab and benchmark as 'non-landable patches' (as in this series).
Construct the patch series as:
- prep patches
- benchmark
- unconditional convert of bpf_ma to REUSE_AFTER_rcu_GP_and_free_after_rcu_tasks_trace
  with numbers from bench(s) before and after this patch.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux