On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@xxxxxx> wrote: > > > On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote: > > On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote: > >> > >> > >> On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote: > >>> When both bpf_spin_lock and bpf_timer are present in a BPF map value, > >>> copy_map_value needs to skirt both objects when copying a value into and > >>> out of the map. However, the current code does not set both s_off and > >>> t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock > >>> is placed in map value with bpf_timer, as bpf_map_update_elem call will > >>> be able to overwrite the other timer object. > >>> > >>> When the issue is not fixed, an overwriting can produce the following > >>> splat: > >>> > >>> [root@(none) bpf]# ./test_progs -t timer_crash > >>> [ 15.930339] bpf_testmod: loading out-of-tree module taints kernel. > >>> [ 16.037849] ================================================================== > >>> [ 16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520 > >>> [ 16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325 > >>> [ 16.039399] > >>> [ 16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G OE 5.16.0+ #278 > >>> [ 16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014 > >>> [ 16.040485] Call Trace: > >>> [ 16.040645] <TASK> > >>> [ 16.040805] dump_stack_lvl+0x59/0x73 > >>> [ 16.041069] ? __pv_queued_spin_lock_slowpath+0x32b/0x520 > >>> [ 16.041427] kasan_report.cold+0x116/0x11b > >>> [ 16.041673] ? __pv_queued_spin_lock_slowpath+0x32b/0x520 > >>> [ 16.042040] __pv_queued_spin_lock_slowpath+0x32b/0x520 > >>> [ 16.042328] ? memcpy+0x39/0x60 > >>> [ 16.042552] ? pv_hash+0xd0/0xd0 > >>> [ 16.042785] ? lockdep_hardirqs_off+0x95/0xd0 > >>> [ 16.043079] __bpf_spin_lock_irqsave+0xdf/0xf0 > >>> [ 16.043366] ? bpf_get_current_comm+0x50/0x50 > >>> [ 16.043608] ? jhash+0x11a/0x270 > >>> [ 16.043848] bpf_timer_cancel+0x34/0xe0 > >>> [ 16.044119] bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81 > >>> [ 16.044500] bpf_trampoline_6442477838_0+0x36/0x1000 > >>> [ 16.044836] __x64_sys_nanosleep+0x5/0x140 > >>> [ 16.045119] do_syscall_64+0x59/0x80 > >>> [ 16.045377] ? lock_is_held_type+0xe4/0x140 > >>> [ 16.045670] ? irqentry_exit_to_user_mode+0xa/0x40 > >>> [ 16.046001] ? mark_held_locks+0x24/0x90 > >>> [ 16.046287] ? asm_exc_page_fault+0x1e/0x30 > >>> [ 16.046569] ? asm_exc_page_fault+0x8/0x30 > >>> [ 16.046851] ? lockdep_hardirqs_on+0x7e/0x100 > >>> [ 16.047137] entry_SYSCALL_64_after_hwframe+0x44/0xae > >>> [ 16.047405] RIP: 0033:0x7f9e4831718d > >>> [ 16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48 > >>> [ 16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023 > >>> [ 16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d > >>> [ 16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0 > >>> [ 16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0 > >>> [ 16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30 > >>> [ 16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > >>> [ 16.051608] </TASK> > >>> [ 16.051762] ================================================================== > >>> > >>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.") > >>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > >>> --- > >>> include/linux/bpf.h | 3 ++- > >>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h > >>> index fa517ae604ad..31a83449808b 100644 > >>> --- a/include/linux/bpf.h > >>> +++ b/include/linux/bpf.h > >>> @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src) > >>> if (unlikely(map_value_has_spin_lock(map))) { > >>> s_off = map->spin_lock_off; > >>> s_sz = sizeof(struct bpf_spin_lock); > >>> - } else if (unlikely(map_value_has_timer(map))) { > >>> + } > >>> + if (unlikely(map_value_has_timer(map))) { > >>> t_off = map->timer_off; > >>> t_sz = sizeof(struct bpf_timer); > >>> } > >> > >> Thanks for the patch. I think we have a bigger problem here with the patch. > >> It actually exposed a few kernel bugs. If you run current selftests, esp. > >> ./test_progs -j which is what I tried, you will observe > >> various testing failures. The reason is due to we preserved the timer or > >> spin lock information incorrectly for a map value. > >> > >> For example, the selftest #179 (timer) will fail with this patch and > >> the following change can fix it. > >> > > > > I actually only saw the same failures (on bpf/master) as in CI, and it seems > > they are there even when I do a run without my patch (related to uprobes). The > > bpftool patch PR in GitHub also has the same error, so I'm guessing it is > > unrelated to this. I also didn't see any difference when running on bpf-next. > > > > As far as others are concerned, I didn't see the failure for timer test, or any > > other ones, for me all timer tests pass properly after applying it. It could be > > that my test VM is not triggering it, because it may depend on the runtime > > system/memory values, etc. > > > > Can you share what error you see? Does it crash or does it just fail? > > For test #179 (timer), most time I saw a hung. But I also see > the oops in bpf_timer_set_callback(). > > > > >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > >> index d29af9988f37..3336d76cc5a6 100644 > >> --- a/kernel/bpf/hashtab.c > >> +++ b/kernel/bpf/hashtab.c > >> @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct > >> bpf_htab *htab, void *key, > >> l_new = ERR_PTR(-ENOMEM); > >> goto dec_count; > >> } > >> - check_and_init_map_value(&htab->map, > >> - l_new->key + round_up(key_size, > >> 8)); > >> } > >> > >> + check_and_init_map_value(&htab->map, > >> + l_new->key + round_up(key_size, 8)); > >> + > > > > Makes sense, but trying to understand why it would fail: > > So this is needed because the reused element from per-CPU region might have > > garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer > > case, we reset timer->timer to NULL in bpf_timer_cancel_and_free. > > > > Earlier copy_map_value further below in this code would also overwrite the timer > > part (which usually may be zero), but that would also not happen anymore. > > That is correct. The preallocated hash tables have a free list. Look > like when an element is put into a free list, its value is not reset. I don't follow. How do you think it can happen? htab_delete/update are calling free_htab_elem() which calls check_and_free_timer(). For pre-alloc htab_update calls check_and_free_timer() directly. There should be never a case when timer is active in the free list.