On Mon, Aug 9, 2021 at 6:04 PM Yonghong Song <yhs@xxxxxx> wrote: > > Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in > bpf_get_local_storage() helper") fixed a bug for bpf_get_local_storage() > helper so different tasks won't mess up with each other's > percpu local storage. > > The percpu data contains 8 slots so it can hold up to 8 contexts > (same or different tasks), for 8 different program runs, > at the same time. This in general is sufficient. But our internal > testing showed the following warning multiple times: > > warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193 > __cgroup_bpf_run_filter_sock_ops+0x13e/0x180 > RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180 > <IRQ> > tcp_call_bpf.constprop.99+0x93/0xc0 > tcp_conn_request+0x41e/0xa50 > ? tcp_rcv_state_process+0x203/0xe00 > tcp_rcv_state_process+0x203/0xe00 > ? sk_filter_trim_cap+0xbc/0x210 > ? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160 > tcp_v6_do_rcv+0x181/0x3e0 > tcp_v6_rcv+0xc65/0xcb0 > ip6_protocol_deliver_rcu+0xbd/0x450 > ip6_input_finish+0x11/0x20 > ip6_input+0xb5/0xc0 > ip6_sublist_rcv_finish+0x37/0x50 > ip6_sublist_rcv+0x1dc/0x270 > ipv6_list_rcv+0x113/0x140 > __netif_receive_skb_list_core+0x1a0/0x210 > netif_receive_skb_list_internal+0x186/0x2a0 > gro_normal_list.part.170+0x19/0x40 > napi_complete_done+0x65/0x150 > mlx5e_napi_poll+0x1ae/0x680 > __napi_poll+0x25/0x120 > net_rx_action+0x11e/0x280 > __do_softirq+0xbb/0x271 > irq_exit_rcu+0x97/0xa0 > common_interrupt+0x7f/0xa0 > </IRQ> > asm_common_interrupt+0x1e/0x40 > RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac > ? __cgroup_bpf_run_filter_skb+0x378/0x4e0 > ? do_softirq+0x34/0x70 > ? ip6_finish_output2+0x266/0x590 > ? ip6_finish_output+0x66/0xa0 > ? ip6_output+0x6c/0x130 > ? ip6_xmit+0x279/0x550 > ? ip6_dst_check+0x61/0xd0 > ... > > Using drgn to dump the percpu buffer contents showed that worth putting the reference to drgn for people not familiar with what it is [0] https://github.com/osandov/drgn > on this cpu slot 0 is still available but > slots 1-7 are occupied and those tasks in slots 1-7 mostly don't exist > any more. So we might have issues in bpf_cgroup_storage_unset(). > > Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset(). > Currently, it tries to unset "current" slot with searching from the start. > So the following sequence is possible: > 1. a task is running and claims slot 0 > 2. running bpf program is done, and it checked slot "0" has the "task" > and ready to reset it to NULL (not yet). > 3. an interrupt happens, another bpf program runs and it claims slot 1 > with the *same* task. > 4. the unset() in interrupt context releases slot 0 since it matches "task". > 5. interrupt is done, the task in process context reset slot 0. > > At the end, slot 1 is not reset and the same process can continue to occupy > slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 bpf program > won't be able to claim an empty slot and a warning will be issued. > > To fix the issue, for unset() function, we should traverse from the last slot > to the first. This way, the above issue can be avoided. > > The same reverse traversal should also be done in bpf_get_local_storage() helper > itself. Otherwise, incorrect local storage may be returned to bpf program. > > Cc: Roman Gushchin <guro@xxxxxx> > Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") > Signed-off-by: Yonghong Song <yhs@xxxxxx> > --- LGTM. Acked-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > include/linux/bpf-cgroup.h | 4 ++-- > kernel/bpf/helpers.c | 4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) > > This patch targets to bpf tree. In bpf-next, > Andrii's c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") > should have fixed the issue too. I also okay with backporting Andrii's patch > to bpf tree if it is viable. > I don't have preferences, but my patch might be a bit harder to backport, so doing this as a fix is fine with me. > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h > index 8b77d08d4b47..6c9b10d82c80 100644 > --- a/include/linux/bpf-cgroup.h > +++ b/include/linux/bpf-cgroup.h > @@ -201,8 +201,8 @@ static inline void bpf_cgroup_storage_unset(void) > { > int i; > > - for (i = 0; i < BPF_CGROUP_STORAGE_NEST_MAX; i++) { > - if (unlikely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current)) > + for (i = BPF_CGROUP_STORAGE_NEST_MAX - 1; i >= 0; i--) { > + if (likely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current)) > continue; > > this_cpu_write(bpf_cgroup_storage_info[i].task, NULL); > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 62cf00383910..9b3f16eee21f 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -397,8 +397,8 @@ BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags) > void *ptr; > int i; > > - for (i = 0; i < BPF_CGROUP_STORAGE_NEST_MAX; i++) { > - if (unlikely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current)) > + for (i = BPF_CGROUP_STORAGE_NEST_MAX - 1; i >= 0; i--) { > + if (likely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current)) > continue; > > storage = this_cpu_read(bpf_cgroup_storage_info[i].storage[stype]); > -- > 2.30.2 >