Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Thu, 4 May 2023 10:30:31 +0800



Hi,

On 5/4/2023 10:00 AM, Alexei Starovoitov wrote:
> On Thu, May 04, 2023 at 09:35:17AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 5/4/2023 2:48 AM, Alexei Starovoitov wrote:
>>> On Sat, Apr 29, 2023 at 06:12:12PM +0800, Hou Tao wrote:
SNIP
>>> +			/* Already queued */
>>> +			if (!queue_work(bpf_ma_wq, &c->reuse_work))
>>> As Martin pointed out queue_work() is not safe here.
>>> The raw_spin_lock(&c->reuse_lock); earlier is not safe either.
>> I see. Didn't recognize these problems.
>>> For the next version please drop workers and spin_lock from unit_free/alloc paths.
>>> If lock has to be taken it should be done from irq_work.
>>> Under no circumstances we can use alloc_workqueue(). No new kthreads.
>> Is there any reason to prohibit the use of new kthread in irq_work ?
> Because:
> 1. there is a workable solution without kthreads.
> 2. if there was no solution we would have to come up with one.
> kthread is not an answer. It's hard to reason about a setup when kthreads
> are in critical path due to scheduler. Assume the system is 100% cpu loaded.
> kthreads delays and behavior is unpredictable. We cannot subject memory alloc/free to it.
I see. Thanks for the explanation.
>
>>> We can avoid adding new flag to bpf_mem_alloc to reduce the complexity
>>> and do roughly equivalent of REUSE_AFTER_RCU_GP unconditionally in the following way:
>>>
>>> - alloc_bulk() won't be trying to steal from c->free_by_rcu.
>>>
>>> - do_call_rcu() does call_rcu(&c->rcu, __free_rcu) instead of task-trace version.
>> No sure whether or not one inflight RCU callback is enough. Will check.
>> If one is not enough, I may use kmalloc(__GFP_NOWAIT) in irq work to
>> allocate multiple RCU callbacks.
> Pls dont. Just assume it will work, implement the proposal (if you agree),
> come back with the numbers and then we will discuss again.
> We cannot keep arguing about merits of complicated patch set that was done on partial data.
OK. Will do.
> Just like the whole thing with kthreads.
> I requested early on: "pls no kthreads" and weeks later we're still arguing.
Sorry about missing that part.
>
>>> - rcu_trace_implies_rcu_gp() is never used.
>>>
>>> - after RCU_GP __free_rcu() moves all waiting_for_gp elements into 
>>>   a size specific link list per bpf_mem_alloc (not per bpf_mem_cache which is per-cpu)
>>>   and does call_rcu_tasks_trace
>>>
>>> - Let's call this list ma->free_by_rcu_tasks_trace
>>>   (only one list for bpf_mem_alloc with known size or NUM_CACHES such lists when size == 0 at init)
>>>
>>> - any cpu alloc_bulk() can steal from size specific ma->free_by_rcu_tasks_trace list that
>>>   is protected by ma->spin_lock (1 or NUM_CACHES such locks)
>> To reduce the lock contention, alloc_bulk() can steal from the global
>> list in batch. 
> Pls no special batches. The simplest implementation possible.
> alloc_bulk() has 'int cnt' argument. It will try to steal 'cnt' from ma->free_by_rcu_tasks_trace.
I see. Will do.
>
>> Had tried the global list before but I didn't do the
>> concurrent freeing, I think it could reduce the risk of OOM for
>> add_del_on_diff_cpu.
> Maybe you've tried, but we didn't see the patches and we cannot take for granted
> anyone saying: "I've tried *foo*. It didn't work. That's why I'm doing *bar* here".
> Everything mm is tricky. Little details matter a lot.
OK. I think it will work. The reason I didn't post it is that I was
obsessed with lock-less bpf ma at that moment.
> It's also questionable whether we should make any design decisions based on this benchmark
> and in particular based on add_del_on_diff_cpu part of it.
> I'm not saying we shouldn't consider it, but all numbers have a "decision weight"
> associated with them.
I see. The reason for add_del_on_diff_cpu is just to complement the
possible use cases of bpf memory allocator.
> For example: there is existing samples/bpf/map_perf_test benchmark.
> So far we haven't seen the numbers from it.
> Is it more important than your new bench? Yes and no. All numbers matter.
Will post the benchmark result for map_perf_test in v4. Had planned to
migrate map_perf_test to selftests/bpf/benchs, but couldn't find enough
time to do that.
>
>>> - ma->waiting_for_gp_tasks_trace will be freeing elements into slab
>>>
>>> What it means that sleepable progs using hashmap will be able to avoid uaf with bpf_rcu_read_lock().
>>> Without explicit bpf_rcu_read_lock() it's still safe and equivalent to existing behavior of bpf_mem_alloc.
>>> (while your proposed BPF_MA_FREE_AFTER_RCU_GP flavor is not safe to use in hashtab with sleepable progs)
>>>
>>> After that we can unconditionally remove rcu_head/call_rcu from bpf_cpumask and improve usability of bpf_obj_drop.
>>> Probably usage of bpf_mem_alloc in local storage can be simplified as well.
>>> Martin wdyt?
>>>
>>> I think this approach adds minimal complexity to bpf_mem_alloc while solving all existing pain points
>>> including needs of qp-trie.
>> Thanks for these great suggestions. Will try to do it in v4.
> Thanks.
> Also for benchmark, pls don't hack htab and benchmark as 'non-landable patches' (as in this series).
> Construct the patch series as:
> - prep patches
> - benchmark
> - unconditional convert of bpf_ma to REUSE_AFTER_rcu_GP_and_free_after_rcu_tasks_trace
>   with numbers from bench(s) before and after this patch.
Thanks again for the suggestion. Will do in v4.