Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 10/8/24 3:10 PM, Alexei Starovoitov wrote:
On Fri, Oct 4, 2024 at 7:03 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:

On 10/4/24 12:52 PM, Alexei Starovoitov wrote:
On Fri, Oct 4, 2024 at 12:28 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
On 10/3/24 10:22 PM, Yonghong Song wrote:
On 10/3/24 3:32 PM, Alexei Starovoitov wrote:
On Thu, Oct 3, 2024 at 1:44 PM Yonghong Song
<yonghong.song@xxxxxxxxx> wrote:
Looks like the idea needs more thought.

in_task_stack() won't recognize the private stack,
so it will look like stack overflow and double fault.

do you have CONFIG_VMAP_STACK ?
Yes, my above test runs fine withCONFIG_VMAP_STACK. Let me guard
private stack support with
CONFIG_VMAP_STACK for now. Not sure whether distributions enable
CONFIG_VMAP_STACK or not.
Good! but I'm surprised it makes a difference.
That only for the test case I tried. Now I tried the whole bpf selftests
with CONFIG_VMAP_STACK on. There are still some failures. Some of them
due to stack protector. I disabled stack protector and then those stack
protector error gone. But some other errors show up like below:

[   27.186581] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
[   27.187480] BUG: unable to handle page fault for address:
ffff888109572800
[   27.188299] #PF: supervisor instruction fetch in kernel mode
[   27.189085] #PF: error_code(0x0011) - permissions violation

or

[   27.736844] BUG: unable to handle page fault for address:
0000000080000000
[   27.737759] #PF: supervisor instruction fetch in kernel mode
[   27.738631] #PF: error_code(0x0010) - not-present page
[   27.739455] PGD 0 P4D 0
[   27.739818] Oops: Oops: 0010 [#1] PREEMPT SMP PTI

...

Some further investigations are needed.
I found one failure case (with stackprotector disabled):

[   20.032611] traps: PANIC: double fault, error_code: 0x0
[   20.032615] Oops: double fault: 0000 [#1] PREEMPT SMP PTI
[   20.032619] CPU: 0 UID: 0 PID: 1959 Comm: test_progs Tainted: G           OE      6.11.0-10576-g17baa0096769-dirty #1006
[   20.032623] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   20.032624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   20.032626] RIP: 0010:error_entry+0x17/0x140
[   20.032633] Code: ff 0f 01 f8 e9 56 fe ff ff 90 90 90 90 90 90 90 90 90 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 49
[   20.032635] RSP: 0018:ffffe8ffff400000 EFLAGS: 00010093
[   20.032637] RAX: ffffe8ffff4000a8 RBX: ffffe8ffff4000a8 RCX: ffffffff82201737
[   20.032639] RDX: 0000000000000000 RSI: ffffffff8220128d RDI: ffffe8ffff4000a8
[   20.032640] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[   20.032641] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   20.032642] R13: 0000000000000000 R14: 000000000002ed80 R15: 0000000000000000
[   20.032643] FS:  00007f8a3a2006c0(0000) GS:ffff888237c00000(0000) knlGS:ffff888237c00000
[   20.032645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.032646] CR2: ffffe8ffff3ffff8 CR3: 0000000103580002 CR4: 0000000000370ef0
[   20.032649] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   20.032650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   20.032651] Call Trace:
[   20.032660]  <#DF>
[   20.032664]  ? __die_body+0xaf/0xc0
[   20.032667]  ? die+0x2f/0x50
[   20.032670]  ? exc_double_fault+0xbf/0xd0
[   20.032674]  ? asm_exc_double_fault+0x23/0x30
[   20.032678]  ? restore_regs_and_return_to_kernel+0x1b/0x1b
[   20.032681]  ? asm_exc_page_fault+0xd/0x30
[   20.032684]  ? error_entry+0x17/0x140
[   20.032687]  </#DF>

The private stack for cpu 0:
     priv_stack_ptr cpu 0 = [ffffe8ffff434000, ffffe8ffff438000] (total 16KB)
That is, the top stack is ffffe8ffff438000 and the bottom stack is ffffe8ffff434000.

During bpf execution, a softirq may happen, at that point,
stack pointer becomes:
      RSP: 0018:ffffe8ffff400000 (see above)
and there is a read/write (mostly write) to address
      CR2: ffffe8ffff3ffff8
And this may cause a fault.
After this fault, there are some further access and probably because
of invalid stack, double fault happens.

So the quesiton is why RSP is reset to ffffe8ffff400000?
0x38000 bytes consumed by stack or rounded down?
That's unlikely.

I have not figured out which code changed this? Maybe somebody can help?
As Kumar said earlier pls share the patch. Link to github? or whichever.

Double check that any kind of tail-call logic is not mixed with priv stack.
Here is the reproducer. Two attached files:
priv_stack.config: the config file to build the kernel
0001-bpf-implement-private-stack.patch: the patch to apply to the top of bpf-next.

The top bpf-next commit in my test:
commit 9502a7de5a61bec3bda841a830560c5d6d40ecac (origin/master, origin/HEAD, master)
Author: Mykyta Yatsenko <yatsenko@xxxxxxxx>
Date:   Tue Oct 1 00:15:22 2024 +0100

      selftests/bpf: Emit top frequent code lines in veristat


I am using clang18 to build the kernel and selftests.
The build command line:
    make LLVM=1 -j
    make -C tools/testing/selftests/bpf LLVM=1 -j

In qemu vm, tools/testing/selftests/bpf directory, run the following script:

$ cat run.sh
cat /proc/sys/net/core/bpf_jit_limit
echo 796917760 > /proc/sys/net/core/bpf_jit_limit
# ./test_progs -n 339/4
./test_progs -t task_local_storage/nodeadlock

With private stack on by default, in my environment, booting will failure.
So by default, private stack is off.
In the above
    echo 796917760 > /proc/sys/net/core/bpf_jit_limit
intends to enable private stack.
well, the patch is written in a way that
   cat /proc/sys/net/core/bpf_jit_limit
is enough to enable priv stack.
Ok. Good to know.


+       if (yhs && !is_subprog && !tail_call_reachable && bpf_prog->aux->priv_stack_ptr) {
+               EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8) + 16);
this part is unnecessary.
Just setting rsp to the top is enough to start doing things

This is a hack. I run out of ideas so added some hack above for debugging purpose.


in a new stack, but I'm afraid this is a dead end.
I've played with orc and frame pointer unwders.
Both are not happy with our hack.
get_stack_info() returns unknown, so any logic that needs to collect
the stack gets into the loop that exhausts our broken stack and
it eventually dies with NULL deref or double fault.
The simplest way to repro the brokeness is with:

@@ -255,6 +255,8 @@ BPF_CALL_5(bpf_task_storage_get, struct bpf_map *,
map, struct task_struct *,
         if (flags & ~BPF_LOCAL_STORAGE_GET_F_CREATE || !task)
                 return (unsigned long)NULL;

+       dump_stack();

Indeed this won't work and the stack trace cannot be properly
presented due to private stack range is outside the default
kernel range.

Without dump_stack(), there are some other unknown errors.

+       return 0;

(in addition to your patch that calls it from
task_local_storage/nodeadlock test).
It will loop like:
[   17.708612] bad stack ffffe8fffd620000
[   17.714759] CPU: 0 UID: 0 PID: 2186 Comm: test_progs Not tainted
6.12.0-rc1-00162-g3fe52dc7912d-dirty #6365
[   17.715692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   17.715692] Call Trace:
[   17.715692] bad stack ffffe8fffd620000
[   17.721814] CPU: 0 UID: 0 PID: 2182 Comm: test_progs Not tainted
6.12.0-rc1-00162-g3fe52dc7912d-dirty #6365
[   17.722763] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   17.722763] Call Trace:
[   17.722763] bad stack ffffe8fffd620000
[   17.728865] CPU: 0 UID: 0 PID: 2156 Comm: test_progs Not tainted
6.12.0-rc1-00162-g3fe52dc7912d-dirty #6365
[   17.729826] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014

To make it less broken it requires a bunch of work in
arch/x86/kernel/dumpstack.c
I tried to hack things up, but wasn't successful.
Even if we do succeed eventually this is too risky.
We don't have a strong reason to introduce another stack type
along with all the delicate logic to recognize it based on rsp
value alone. If our percpu stack is per prog the get_stack_info()
logic gets complicated and that's not acceptable.
We need to scrap this idea.
Let's go back to push/pop r11 around calls :(




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux