On 2/10/22 4:02 PM, Kumar Kartikeya Dwivedi wrote:
On Fri, Feb 11, 2022 at 05:24:55AM IST, Yonghong Song wrote:
On 2/10/22 2:49 PM, Alexei Starovoitov wrote:
On Thu, Feb 10, 2022 at 12:05 AM Yonghong Song <yhs@xxxxxx> wrote:
On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:
On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
When both bpf_spin_lock and bpf_timer are present in a BPF map value,
copy_map_value needs to skirt both objects when copying a value into and
out of the map. However, the current code does not set both s_off and
t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
is placed in map value with bpf_timer, as bpf_map_update_elem call will
be able to overwrite the other timer object.
When the issue is not fixed, an overwriting can produce the following
splat:
[root@(none) bpf]# ./test_progs -t timer_crash
[ 15.930339] bpf_testmod: loading out-of-tree module taints kernel.
[ 16.037849] ==================================================================
[ 16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
[ 16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
[ 16.039399]
[ 16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G OE 5.16.0+ #278
[ 16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
[ 16.040485] Call Trace:
[ 16.040645] <TASK>
[ 16.040805] dump_stack_lvl+0x59/0x73
[ 16.041069] ? __pv_queued_spin_lock_slowpath+0x32b/0x520
[ 16.041427] kasan_report.cold+0x116/0x11b
[ 16.041673] ? __pv_queued_spin_lock_slowpath+0x32b/0x520
[ 16.042040] __pv_queued_spin_lock_slowpath+0x32b/0x520
[ 16.042328] ? memcpy+0x39/0x60
[ 16.042552] ? pv_hash+0xd0/0xd0
[ 16.042785] ? lockdep_hardirqs_off+0x95/0xd0
[ 16.043079] __bpf_spin_lock_irqsave+0xdf/0xf0
[ 16.043366] ? bpf_get_current_comm+0x50/0x50
[ 16.043608] ? jhash+0x11a/0x270
[ 16.043848] bpf_timer_cancel+0x34/0xe0
[ 16.044119] bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
[ 16.044500] bpf_trampoline_6442477838_0+0x36/0x1000
[ 16.044836] __x64_sys_nanosleep+0x5/0x140
[ 16.045119] do_syscall_64+0x59/0x80
[ 16.045377] ? lock_is_held_type+0xe4/0x140
[ 16.045670] ? irqentry_exit_to_user_mode+0xa/0x40
[ 16.046001] ? mark_held_locks+0x24/0x90
[ 16.046287] ? asm_exc_page_fault+0x1e/0x30
[ 16.046569] ? asm_exc_page_fault+0x8/0x30
[ 16.046851] ? lockdep_hardirqs_on+0x7e/0x100
[ 16.047137] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 16.047405] RIP: 0033:0x7f9e4831718d
[ 16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
[ 16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
[ 16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
[ 16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
[ 16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
[ 16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
[ 16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 16.051608] </TASK>
[ 16.051762] ==================================================================
Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
---
include/linux/bpf.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fa517ae604ad..31a83449808b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
if (unlikely(map_value_has_spin_lock(map))) {
s_off = map->spin_lock_off;
s_sz = sizeof(struct bpf_spin_lock);
- } else if (unlikely(map_value_has_timer(map))) {
+ }
+ if (unlikely(map_value_has_timer(map))) {
t_off = map->timer_off;
t_sz = sizeof(struct bpf_timer);
}
Thanks for the patch. I think we have a bigger problem here with the patch.
It actually exposed a few kernel bugs. If you run current selftests, esp.
./test_progs -j which is what I tried, you will observe
various testing failures. The reason is due to we preserved the timer or
spin lock information incorrectly for a map value.
For example, the selftest #179 (timer) will fail with this patch and
the following change can fix it.
I actually only saw the same failures (on bpf/master) as in CI, and it seems
they are there even when I do a run without my patch (related to uprobes). The
bpftool patch PR in GitHub also has the same error, so I'm guessing it is
unrelated to this. I also didn't see any difference when running on bpf-next.
As far as others are concerned, I didn't see the failure for timer test, or any
other ones, for me all timer tests pass properly after applying it. It could be
that my test VM is not triggering it, because it may depend on the runtime
system/memory values, etc.
Can you share what error you see? Does it crash or does it just fail?
For test #179 (timer), most time I saw a hung. But I also see
the oops in bpf_timer_set_callback().
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d29af9988f37..3336d76cc5a6 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
bpf_htab *htab, void *key,
l_new = ERR_PTR(-ENOMEM);
goto dec_count;
}
- check_and_init_map_value(&htab->map,
- l_new->key + round_up(key_size,
8));
}
+ check_and_init_map_value(&htab->map,
+ l_new->key + round_up(key_size, 8));
+
Makes sense, but trying to understand why it would fail:
So this is needed because the reused element from per-CPU region might have
garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.
Earlier copy_map_value further below in this code would also overwrite the timer
part (which usually may be zero), but that would also not happen anymore.
That is correct. The preallocated hash tables have a free list. Look
like when an element is put into a free list, its value is not reset.
I don't follow. How do you think it can happen?
htab_delete/update are calling free_htab_elem()
which calls check_and_free_timer().
For pre-alloc htab_update calls check_and_free_timer() directly.
There should be never a case when timer is active in the free list.
The issue is not a timer active in the free list. It is the timer value
is not reset to 0 in the free list.
For example,
1. value->timer... is set properly (non zero)
2. value is deleted through update or delete, value->timer
is cancelled and freed, and the hash_elem is put into
free list. But the hash_elem value->timer is not zero.
But in all cases, check_and_free_timer was called right? Which then calls
bpf_timer_cancel_and_free which does this:
1336 void bpf_timer_cancel_and_free(void *val)
1337 {
1338 struct bpf_timer_kern *timer = val;
1339 struct bpf_hrtimer *t;
1340
1341 /* Performance optimization: read timer->timer without lock first. */
1342 if (!READ_ONCE(timer->timer))
1343 return;
1344
1345 __bpf_spin_lock_irqsave(&timer->lock);
1346 /* re-read it under lock */
1347 t = timer->timer;
1348 if (!t)
1349 goto out;
1350 drop_prog_refcnt(t);
1351 /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
1352 * this timer, since it won't be initialized.
1353 */
1354 timer->timer = NULL;
...
So the timer->timer was set to NULL in the map value.
Looking at one call site:
bpf_timer_cancel_and_free(elem->key +
round_up(htab->map.key_size, 8) +
htab->map.timer_off);
So I am talking about to have
*(struct bpf_timer *)value = (struct bpf_timer){};
The timer->timer is for bpf_hrtimer.
3. one hash_elem is picked up from the free list,
and proper value is copied to the value except value->timer
and value->spinlock (if they exist). This happens with this patch.
4. some later kernel functions may see value->timer is set and
do something bad ...
--
Kartikeya