Re: [PATCH bpf v2 1/2] bpf: Fix crash due to incorrect copy_map_value

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2/9/22 11:52 AM, Kumar Kartikeya Dwivedi wrote:
On Thu, Feb 10, 2022 at 12:36:08AM IST, Yonghong Song wrote:


On 2/8/22 11:03 PM, Kumar Kartikeya Dwivedi wrote:
When both bpf_spin_lock and bpf_timer are present in a BPF map value,
copy_map_value needs to skirt both objects when copying a value into and
out of the map. However, the current code does not set both s_off and
t_off in copy_map_value, which leads to a crash when e.g. bpf_spin_lock
is placed in map value with bpf_timer, as bpf_map_update_elem call will
be able to overwrite the other timer object.

When the issue is not fixed, an overwriting can produce the following
splat:

[root@(none) bpf]# ./test_progs -t timer_crash
[   15.930339] bpf_testmod: loading out-of-tree module taints kernel.
[   16.037849] ==================================================================
[   16.038458] BUG: KASAN: user-memory-access in __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.038944] Write of size 8 at addr 0000000000043ec0 by task test_progs/325
[   16.039399]
[   16.039514] CPU: 0 PID: 325 Comm: test_progs Tainted: G           OE     5.16.0+ #278
[   16.039983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
[   16.040485] Call Trace:
[   16.040645]  <TASK>
[   16.040805]  dump_stack_lvl+0x59/0x73
[   16.041069]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.041427]  kasan_report.cold+0x116/0x11b
[   16.041673]  ? __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.042040]  __pv_queued_spin_lock_slowpath+0x32b/0x520
[   16.042328]  ? memcpy+0x39/0x60
[   16.042552]  ? pv_hash+0xd0/0xd0
[   16.042785]  ? lockdep_hardirqs_off+0x95/0xd0
[   16.043079]  __bpf_spin_lock_irqsave+0xdf/0xf0
[   16.043366]  ? bpf_get_current_comm+0x50/0x50
[   16.043608]  ? jhash+0x11a/0x270
[   16.043848]  bpf_timer_cancel+0x34/0xe0
[   16.044119]  bpf_prog_c4ea1c0f7449940d_sys_enter+0x7c/0x81
[   16.044500]  bpf_trampoline_6442477838_0+0x36/0x1000
[   16.044836]  __x64_sys_nanosleep+0x5/0x140
[   16.045119]  do_syscall_64+0x59/0x80
[   16.045377]  ? lock_is_held_type+0xe4/0x140
[   16.045670]  ? irqentry_exit_to_user_mode+0xa/0x40
[   16.046001]  ? mark_held_locks+0x24/0x90
[   16.046287]  ? asm_exc_page_fault+0x1e/0x30
[   16.046569]  ? asm_exc_page_fault+0x8/0x30
[   16.046851]  ? lockdep_hardirqs_on+0x7e/0x100
[   16.047137]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   16.047405] RIP: 0033:0x7f9e4831718d
[   16.047602] Code: b4 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 6c 0c 00 f7 d8 64 89 01 48
[   16.048764] RSP: 002b:00007fff488086b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000023
[   16.049275] RAX: ffffffffffffffda RBX: 00007f9e48683740 RCX: 00007f9e4831718d
[   16.049747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fff488086d0
[   16.050225] RBP: 00007fff488086f0 R08: 00007fff488085d7 R09: 00007f9e4cb594a0
[   16.050648] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f9e484cde30
[   16.051124] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   16.051608]  </TASK>
[   16.051762] ==================================================================

Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
---
   include/linux/bpf.h | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fa517ae604ad..31a83449808b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -224,7 +224,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
   	if (unlikely(map_value_has_spin_lock(map))) {
   		s_off = map->spin_lock_off;
   		s_sz = sizeof(struct bpf_spin_lock);
-	} else if (unlikely(map_value_has_timer(map))) {
+	}
+	if (unlikely(map_value_has_timer(map))) {
   		t_off = map->timer_off;
   		t_sz = sizeof(struct bpf_timer);
   	}

Thanks for the patch. I think we have a bigger problem here with the patch.
It actually exposed a few kernel bugs. If you run current selftests, esp.
./test_progs -j which is what I tried, you will observe
various testing failures. The reason is due to we preserved the timer or
spin lock information incorrectly for a map value.

For example, the selftest #179 (timer) will fail with this patch and
the following change can fix it.


I actually only saw the same failures (on bpf/master) as in CI, and it seems
they are there even when I do a run without my patch (related to uprobes). The
bpftool patch PR in GitHub also has the same error, so I'm guessing it is
unrelated to this. I also didn't see any difference when running on bpf-next.

As far as others are concerned, I didn't see the failure for timer test, or any
other ones, for me all timer tests pass properly after applying it. It could be
that my test VM is not triggering it, because it may depend on the runtime
system/memory values, etc.

Can you share what error you see? Does it crash or does it just fail?

For test #179 (timer), most time I saw a hung. But I also see
the oops in bpf_timer_set_callback().


diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d29af9988f37..3336d76cc5a6 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -961,10 +961,11 @@ static struct htab_elem *alloc_htab_elem(struct
bpf_htab *htab, void *key,
                         l_new = ERR_PTR(-ENOMEM);
                         goto dec_count;
                 }
-               check_and_init_map_value(&htab->map,
-                                        l_new->key + round_up(key_size,
8));
         }

+       check_and_init_map_value(&htab->map,
+                                l_new->key + round_up(key_size, 8));
+

Makes sense, but trying to understand why it would fail:
So this is needed because the reused element from per-CPU region might have
garbage in the bpf_spin_lock/bpf_timer fields? But I think atleast for timer
case, we reset timer->timer to NULL in bpf_timer_cancel_and_free.

Earlier copy_map_value further below in this code would also overwrite the timer
part (which usually may be zero), but that would also not happen anymore.

That is correct. The preallocated hash tables have a free list. Look like when an element is put into a free list, its value is not reset.
this in general is fine as later on a new value will be copied to
overwrite the old one, but this is except spin_lock and timer value.
spin_lock probably fine as it just contains a value like unlocked state.
but timer is a problem as timer has been released and reusing the timer
could cause all kind of issues.

Without your patch, if spinlock exists timer is always reset to 0 and hence we won't have problems. If spinlock doesn't exist, looks like
we could still have issues.

My above change is really just a hack.

The following is (sort of) a proper patch and with your patch the
selftests do pass in my environment.

commit cf9d4a47d95dc992504a717e4c39bbf2bac8b41c (HEAD -> tmp10)
Author: Yonghong Song <yhs@xxxxxx>
Date:   Wed Feb 9 19:53:42 2022 -0800

    bpf: clear timer field for new elements in preallocated hash tables

    Currently, when a hash element is freed and released to freelist
    pool, its value is not reset. This may cause a problem if the value
    contains a timer. Since the timer has been cancelled and freed,
    reusing the old pointer may cause various kernel issues including
    hang or oops. This can happen with the below command
      ./test_progs -t timer

    There are two approaches to resolve this. One is to reset the
    timer value when the element is put to the freelist, and the second
    is to clear the timer value when the element from the freelist
    is ready to be used. This patch implemented the second approach.

    Signed-off-by: Yonghong Song <yhs@xxxxxx>

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d29af9988f37..558f3c40307a 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -916,6 +916,12 @@ static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab)
               BITS_PER_LONG == 64;
 }

+static inline void check_and_init_timer_value(struct bpf_map *map, void *dst)
+{
+       if (unlikely(map_value_has_timer(map)))
+               memset(dst + map->timer_off, 0, sizeof(struct bpf_timer));
+}
+
 static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
void *value, u32 key_size, u32 hash,
                                         bool percpu, bool onallcpus,
@@ -925,6 +931,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
        bool prealloc = htab_is_prealloc(htab);
        struct htab_elem *l_new, **pl_new;
        void __percpu *pptr;
+       void *new_value;

        if (prealloc) {
                if (old_elem) {
@@ -989,9 +996,9 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
                size = round_up(size, 8);
                memcpy(l_new->key + round_up(key_size, 8), value, size);
        } else {
-               copy_map_value(&htab->map,
-                              l_new->key + round_up(key_size, 8),
-                              value);
+               new_value = l_new->key + round_up(key_size, 8);
+               check_and_init_timer_value(&htab->map, new_value);
+               copy_map_value(&htab->map, new_value, value);
        }

        l_new->hash = hash;
@@ -1127,6 +1134,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
        unsigned long flags;
        struct bucket *b;
        u32 key_size, hash;
+       void *new_value;
        int ret;

        if (unlikely(map_flags > BPF_EXIST))
@@ -1151,8 +1159,10 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
        l_new = prealloc_lru_pop(htab, key, hash);
        if (!l_new)
                return -ENOMEM;
-       copy_map_value(&htab->map,
-                      l_new->key + round_up(map->key_size, 8), value);
+
+       new_value = l_new->key + round_up(map->key_size, 8);
+       check_and_init_timer_value(&htab->map, new_value);
+       copy_map_value(&htab->map, new_value, value);

        ret = htab_lock_bucket(htab, b, hash, &flags);


         memcpy(l_new->key, key, key_size);
         if (percpu) {
                 size = round_up(size, 8);

Basically, the above clears new map value timer/spin_lock fields for *all*
new items.

There are still some selftest failures due to this patch.
Could you run selftest, with necessary additional fixes, to ensure
all selftests passed?

Assuming that uprobe ones are unrelated, should I add this diff and respin? Or
are you seeing other errors as well apart from the timer (#179) test?

Yes, please incorporate my above change into your patch and resubmit.
folks may have different opinions e.g. to clear timer field right
before entering to the free list.


--
Kartikeya



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux