Hi, On 2/15/2023 9:54 AM, Martin KaFai Lau wrote: > On 2/11/23 8:34 AM, Alexei Starovoitov wrote: >> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov >> <alexei.starovoitov@xxxxxxxxx> wrote: >>> >>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>>>> Hou, are you plannning to resubmit this change? I also hit this while >>>>>> testing my >>>>>> changes on bpf-next. >>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? >>>>> The former will take a long time to settle. >>>>> The latter is trivial. >>>>> To unblock yourself just add GFP_ZERO in an extra patch? >>>> Sorry for the long delay. Just find find out time to do some tests to compare >>>> the performance of bzero and ctor. After it is done, will resubmit on next >>>> week. >>> >>> I still don't like ctor as a concept. In general the callbacks in the critical >>> path are guaranteed to be slow due to retpoline overhead. >>> Please send a patch to add GFP_ZERO. >>> >>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable >>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace. >>> This approach will work for inner nodes of qptrie, since bpf progs >>> never see pointers to them. It will work for local storage >>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu. >>> It's also safe without uaf caveat in sleepable progs and sleepable progs >> >> I meant 'safe with uaf caveat'. >> Safe because we wait for rcu_task_trace later before returning to kernel memory. For qp-trie, I had added reuse checking for qp-trie inner node by adding a version in both child pointer and child node itself and it seemed works, but using BPF_REUSE_AFTER_RCU_GP for inner node will be much simpler for the implementation. And it seems for qp-trie leaf node, BPF_REUSE_AFTER_RCU_GP is needed as well, else the value returned to caller in bpf program or syscall may be reused just like hash-table during its usage. We can change qp-trie to act as a set only (e.g., no value), but that will limit its usage scenario. Maybe requiring the caller to use bpf_rcu_read_lock() is solution as well. What do you think ? > > For local storage, when its owner (sk/task/inode/cgrp) is going away, the > memory can be reused immediately. No rcu gp is needed. Now it seems it will wait for RCU GP and i think it is still necessary, because when the process exits, other processes may still access the local storage through pidfd or task_struct of the exited process. > > The local storage delete case (eg. bpf_sk_storage_delete) is the only one that > needs to be freed by tasks_trace gp because another bpf prog (reader) may be > under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on > allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp > can be extended to the local storage delete case. I think we can extend the > assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock() > when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also. > It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not only bpf_task_storage_get(), but also the dereferences of the returned local storage ptr, right ? I think qp-trie may also need this. > I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and > the BPF_REUSE_AFTER_RCU_GP idea. Probably will get the GFP_ZERO out first. I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we cooperate together to accomplish that ? > >> >>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf. >>> So please respin the set with rcu gp only and that new flag.