On Fri, Feb 11, 2022 at 05:24:55AM IST, Yonghong Song wrote: > > > On 2/10/22 2:49 PM, Alexei Starovoitov wrote: > > On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@xxxxxx> wrote: > > > > > > > > > On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote: > > > > On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote: > > > > > > > > > > > > > > > On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote: > > > > > > When both bpf_spin_lock and bpf_timer are present in a BPF map value, > > > > > > copy_map_value needs to skirt both objects when copying a value into and > > > > > > out of the map. However, the current code does not set both s_off and > > > > > > t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock > > > > > > is placed in map value with bpf_timer, as bpf_map_update_elem call will > > > > > > be able to overwrite the other timer object. > > > > > > > > > > > > When the issue is not fixed, an overwriting can produce the following > > > > > > splat: > > > > > > > > > > > > [root@(none) bpf]# ./test_progs -t timer_crash > > > > > > [ 15.930339] bpf_testmod: loading out-of-tree module taints kernel. > > > > > > [ 16.037849] ================================================================== > > > > > > [ 16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520 > > > > > > [ 16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325 > > > > > > [ 16.039399] > > > > > > [ 16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G OE 5.16.0+ #278 > > > > > > [ 16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014 > > > > > > [ 16.040485] Call Trace: > > > > > > [ 16.040645] <TASK> > > > > > > [ 16.040805] dump_stack_lvl+0x59/0x73 > > > > > > [ 16.041069] ? __pv_queued_spin_lock_slowpath+0x32b/0x520 > > > > > > [ 16.041427] kasan_report.cold+0x116/0x11b > > > > > > [ 16.041673] ? __pv_queued_spin_lock_slowpath+0x32b/0x520 > > > > > > [ 16.042040] __pv_queued_spin_lock_slowpath+0x32b/0x520 > > > > > > [ 16.042328] ? memcpy+0x39/0x60 > > > > > > [ 16.042552] ? pv_hash+0xd0/0xd0 > > > > > > [ 16.042785] ? lockdep_hardirqs_off+0x95/0xd0 > > > > > > [ 16.043079] __bpf_spin_lock_irqsave+0xdf/0xf0 > > > > > > [ 16.043366] ? bpf_get_current_comm+0x50/0x50 > > > > > > [ 16.043608] ? jhash+0x11a/0x270 > > > > > > [ 16.043848] bpf_timer_cancel+0x34/0xe0 > > > > > > [ 16.044119] bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81 > > > > > > [ 16.044500] bpf_trampoline_6442477838_0+0x36/0x1000 > > > > > > [ 16.044836] __x64_sys_nanosleep+0x5/0x140 > > > > > > [ 16.045119] do_syscall_64+0x59/0x80 > > > > > > [ 16.045377] ? lock_is_held_type+0xe4/0x140 > > > > > > [ 16.045670] ? irqentry_exit_to_user_mode+0xa/0x40 > > > > > > [ 16.046001] ? mark_held_locks+0x24/0x90 > > > > > > [ 16.046287] ? asm_exc_page_fault+0x1e/0x30 > > > > > > [ 16.046569] ? asm_exc_page_fault+0x8/0x30 > > > > > > [ 16.046851] ? lockdep_hardirqs_on+0x7e/0x100 > > > > > > [ 16.047137] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > > [ 16.047405] RIP: 0033:0x7f9e4831718d > > > > > > [ 16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48 > > > > > > [ 16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023 > > > > > > [ 16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d > > > > > > [ 16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0 > > > > > > [ 16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0 > > > > > > [ 16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30 > > > > > > [ 16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > > > [ 16.051608] </TASK> > > > > > > [ 16.051762] ================================================================== > > > > > > > > > > > > Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.") > > > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > > > > > > --- > > > > > > include/linux/bpf.h | 3 ++- > > > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > > > > index fa517ae604ad..31a83449808b 100644 > > > > > > --- a/include/linux/bpf.h > > > > > > +++ b/include/linux/bpf.h > > > > > > @@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src) > > > > > > if (unlikely(map_value_has_spin_lock(map))) { > > > > > > s_off = map->spin_lock_off; > > > > > > s_sz = sizeof(struct bpf_spin_lock); > > > > > > - } else if (unlikely(map_value_has_timer(map))) { > > > > > > + } > > > > > > + if (unlikely(map_value_has_timer(map))) { > > > > > > t_off = map->timer_off; > > > > > > t_sz = sizeof(struct bpf_timer); > > > > > > } > > > > > > > > > > Thanks for the patch. I think we have a bigger problem here with the patch. > > > > > It actually exposed a few kernel bugs. If you run current selftests, esp. > > > > > ./test_progs -j which is what I tried, you will observe > > > > > various testing failures. The reason is due to we preserved the timer or > > > > > spin lock information incorrectly for a map value. > > > > > > > > > > For example, the selftest #179 (timer) will fail with this patch and > > > > > the following change can fix it. > > > > > > > > > > > > > I actually only saw the same failures (on bpf/master) as in CI, and it seems > > > > they are there even when I do a run without my patch (related to uprobes). The > > > > bpftool patch PR in GitHub also has the same error, so I'm guessing it is > > > > unrelated to this. I also didn't see any difference when running on bpf-next. > > > > > > > > As far as others are concerned, I didn't see the failure for timer test, or any > > > > other ones, for me all timer tests pass properly after applying it. It could be > > > > that my test VM is not triggering it, because it may depend on the runtime > > > > system/memory values, etc. > > > > > > > > Can you share what error you see? Does it crash or does it just fail? > > > > > > For test #179 (timer), most time I saw a hung. But I also see > > > the oops in bpf_timer_set_callback(). > > > > > > > > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > > > > index d29af9988f37..3336d76cc5a6 100644 > > > > > --- a/kernel/bpf/hashtab.c > > > > > +++ b/kernel/bpf/hashtab.c > > > > > @@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct > > > > > bpf_htab *htab, void *key, > > > > > l_new = ERR_PTR(-ENOMEM); > > > > > goto dec_count; > > > > > } > > > > > - check_and_init_map_value(&htab->map, > > > > > - l_new->key + round_up(key_size, > > > > > 8)); > > > > > } > > > > > > > > > > + check_and_init_map_value(&htab->map, > > > > > + l_new->key + round_up(key_size, 8)); > > > > > + > > > > > > > > Makes sense, but trying to understand why it would fail: > > > > So this is needed because the reused element from per-CPU region might have > > > > garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer > > > > case, we reset timer->timer to NULL in bpf_timer_cancel_and_free. > > > > > > > > Earlier copy_map_value further below in this code would also overwrite the timer > > > > part (which usually may be zero), but that would also not happen anymore. > > > > > > That is correct. The preallocated hash tables have a free list. Look > > > like when an element is put into a free list, its value is not reset. > > > > I don't follow. How do you think it can happen? > > htab_delete/update are calling free_htab_elem() > > which calls check_and_free_timer(). > > For pre-alloc htab_update calls check_and_free_timer() directly. > > There should be never a case when timer is active in the free list. > > The issue is not a timer active in the free list. It is the timer value > is not reset to 0 in the free list. > For example, > 1. value->timer... is set properly (non zero) > 2. value is deleted through update or delete, value->timer > is cancelled and freed, and the hash_elem is put into > free list. But the hash_elem value->timer is not zero. But in all cases, check_and_free_timer was called right? Which then calls bpf_timer_cancel_and_free which does this: 1336 void bpf_timer_cancel_and_free(void *val) 1337 { 1338 struct bpf_timer_kern *timer = val; 1339 struct bpf_hrtimer *t; 1340 1341 /* Performance optimization: read timer->timer without lock first. */ 1342 if (!READ_ONCE(timer->timer)) 1343 return; 1344 1345 __bpf_spin_lock_irqsave(&timer->lock); 1346 /* re-read it under lock */ 1347 t = timer->timer; 1348 if (!t) 1349 goto out; 1350 drop_prog_refcnt(t); 1351 /* The subsequent bpf_timer_start/cancel() helpers won't be able to use 1352 * this timer, since it won't be initialized. 1353 */ 1354 timer->timer = NULL; ... So the timer->timer was set to NULL in the map value. > 3. one hash_elem is picked up from the free list, > and proper value is copied to the value except value->timer > and value->spinlock (if they exist). This happens with this patch. > 4. some later kernel functions may see value->timer is set and > do something bad ... -- Kartikeya