Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 2024-08-25 at 21:09 +0800, Leon Hwang wrote:
> This patch fixes a tailcall infinite loop issue caused by freplace.
> 
> Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"),
> freplace prog is allowed to tail call its target prog. Then, when a
> freplace prog attaches to its target prog's subprog and tail calls its
> target prog, kernel will panic.
> 
> For example:
> 
> tc_bpf2bpf.c:
> 
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
> 
> __noinline
> int subprog_tc(struct __sk_buff *skb)
> {
> 	return skb->len * 2;
> }
> 
> SEC("tc")
> int entry_tc(struct __sk_buff *skb)
> {
> 	return subprog_tc(skb);
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> tailcall_bpf2bpf_hierarchy_freplace.c:
> 
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
> 
> struct {
> 	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> 	__uint(max_entries, 1);
> 	__uint(key_size, sizeof(__u32));
> 	__uint(value_size, sizeof(__u32));
> } jmp_table SEC(".maps");
> 
> int count = 0;
> 
> static __noinline
> int subprog_tail(struct __sk_buff *skb)
> {
> 	bpf_tail_call_static(skb, &jmp_table, 0);
> 	return 0;
> }
> 
> SEC("freplace")
> int entry_freplace(struct __sk_buff *skb)
> {
> 	count++;
> 	subprog_tail(skb);
> 	subprog_tail(skb);
> 	return count;
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> The attach target of entry_freplace is subprog_tc, and the tail callee
> in subprog_tail is entry_tc.
> 
> Then, the infinite loop will be entry_tc -> subprog_tc -> entry_freplace
> -> subprog_tail --tailcall-> entry_tc, because tail_call_cnt in
> entry_freplace will count from zero for every time of entry_freplace
> execution. Kernel will panic:
> 
> [   15.310490] BUG: TASK stack guard page was hit at (____ptrval____)
> (stack is (____ptrval____)..(____ptrval____))
> [   15.310490] Oops: stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [   15.310490] CPU: 1 PID: 89 Comm: test_progs Tainted: G           OE
>    6.10.0-rc6-g026dcdae8d3e-dirty #72
> [   15.310490] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX,
> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [   15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
> fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
> [   15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
> [   15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
> 0000000000008cb5
> [   15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
> ffff9c98808b7e00
> [   15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
> 0000000000000000
> [   15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffffb500c01af000
> [   15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
> 0000000000000000
> [   15.310490] FS:  00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
> knlGS:0000000000000000
> [   15.310490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
> 00000000000006f0
> [   15.310490] Call Trace:
> [   15.310490]  <#DF>
> [   15.310490]  ? die+0x36/0x90
> [   15.310490]  ? handle_stack_overflow+0x4d/0x60
> [   15.310490]  ? exc_double_fault+0x117/0x1a0
> [   15.310490]  ? asm_exc_double_fault+0x23/0x30
> [   15.310490]  ? bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [   15.310490]  </#DF>
> [   15.310490]  <TASK>
> [   15.310490]  bpf_prog_85781a698094722f_entry+0x4c/0x64
> [   15.310490]  bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
> [   15.310490]  ...
> [   15.310490]  bpf_prog_85781a698094722f_entry+0x4c/0x64
> [   15.310490]  bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
> [   15.310490]  bpf_test_run+0x210/0x370
> [   15.310490]  ? bpf_test_run+0x128/0x370
> [   15.310490]  bpf_prog_test_run_skb+0x388/0x7a0
> [   15.310490]  __sys_bpf+0xdbf/0x2c40
> [   15.310490]  ? clockevents_program_event+0x52/0xf0
> [   15.310490]  ? lock_release+0xbf/0x290
> [   15.310490]  __x64_sys_bpf+0x1e/0x30
> [   15.310490]  do_syscall_64+0x68/0x140
> [   15.310490]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   15.310490] RIP: 0033:0x7f133b52725d
> [   15.310490] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
> [   15.310490] RSP: 002b:00007ffddbc10258 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000141
> [   15.310490] RAX: ffffffffffffffda RBX: 00007ffddbc10828 RCX:
> 00007f133b52725d
> [   15.310490] RDX: 0000000000000050 RSI: 00007ffddbc102a0 RDI:
> 000000000000000a
> [   15.310490] RBP: 00007ffddbc10270 R08: 0000000000000000 R09:
> 00007ffddbc102a0
> [   15.310490] R10: 0000000000000064 R11: 0000000000000206 R12:
> 0000000000000004
> [   15.310490] R13: 0000000000000000 R14: 0000558ec4c24890 R15:
> 00007f133b6ed000
> [   15.310490]  </TASK>
> [   15.310490] Modules linked in: bpf_testmod(OE)
> [   15.310490] ---[ end trace 0000000000000000 ]---
> [   15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [   15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
> fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
> [   15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
> [   15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
> 0000000000008cb5
> [   15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
> ffff9c98808b7e00
> [   15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
> 0000000000000000
> [   15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffffb500c01af000
> [   15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
> 0000000000000000
> [   15.310490] FS:  00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
> knlGS:0000000000000000
> [   15.310490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
> 00000000000006f0
> [   15.310490] Kernel panic - not syncing: Fatal exception in interrupt
> [   15.310490] Kernel Offset: 0x30000000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> This patch fixes the issue by initializing tail_call_cnt at the prologue
> of entry_tc.
> 
> Next, when call subprog_tc, the tail_call_cnt_ptr is propagated to
> subprog_tc by rax.
> Next, when jump to entry_freplace, the tail_call_cnt_ptr will be reused to
> count tailcall in freplace prog.
> Next, when call subprog_tail, the tail_call_cnt_ptr is propagated to
> subprog_tail by rax.
> Next, while tail calling to entry_tc, the tail_call_cnt on the stack of
> entry_tc increments via the tail_call_cnt_ptr.
> 
> The whole procedure shows as the following JITed prog dumping.
> 
> bpftool p d j n entry_tc:
> 
> int entry_tc(struct __sk_buff * skb):
> bpf_prog_1c515f389a9059b4_entry_tc:
> ; return subprog_tc(skb);
>    0:	endbr64
>    4:	xorq	%rax, %rax		;; rax = 0 (tail_call_cnt)
>    7:	nopl	(%rax,%rax)
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	cmpq	$33, %rax		;; if rax > 33, rax = tcc_ptr
>   18:	ja	0x20			;; if rax > 33 goto 0x20 ---+
>   1a:	pushq	%rax			;; [rbp - 8] = rax = 0      |
>   1b:	movq	%rsp, %rax		;; rax = rbp - 8            |
>   1e:	jmp	0x21			;; ---------+               |
>   20:	pushq	%rax			;; <--------|---------------+
>   21:	pushq	%rax			;; <--------+ [rbp - 16] = rax
>   22:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
>   29:	callq	0x70			;; call subprog_tc()
> ; return subprog_tc(skb);
>   2e:	leave
>   2f:	retq
> 
> int subprog_tc(struct __sk_buff * skb):
> bpf_prog_1e8f76e2374a0607_subprog_tc:
> ; return skb->len * 2;
>    0:	endbr64
>    4:	nopl	(%rax)			;; do not touch tail_call_cnt
>    7:	jmp	0x108			;; jump to entry_freplace()
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	pushq	%rax
>   15:	pushq	%rax
>   16:	movl	112(%rdi), %eax
> ; return skb->len * 2;
>   19:	shll	%eax
> ; return skb->len * 2;
>   1b:	leave
>   1c:	retq
> 
> bpftool p d j n entry_freplace:
> 
> int entry_freplace(struct __sk_buff * skb):
> bpf_prog_85781a698094722f_entry_freplace:
> ; int entry_freplace(struct __sk_buff *skb)
>    0:	endbr64
>    4:	nopl	(%rax)			;; do not touch tail_call_cnt
>    7:	nopl	(%rax,%rax)
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	cmpq	$33, %rax		;; if rax > 33, rax = tcc_ptr
>   18:	ja	0x20			;; if rax > 33 goto 0x20 ---+
>   1a:	pushq	%rax			;; [rbp - 8] = rax = 0      |
>   1b:	movq	%rsp, %rax		;; rax = rbp - 8            |
>   1e:	jmp	0x21			;; ---------+               |
>   20:	pushq	%rax			;; <--------|---------------+
>   21:	pushq	%rax			;; <--------+ [rbp - 16] = rax
>   22:	pushq	%rbx			;; callee saved
>   23:	pushq	%r13			;; callee saved
>   25:	movq	%rdi, %rbx		;; rbx = skb (callee saved)
> ; count++;
>   28:	movabsq	$-123406219759616, %r13
>   32:	movl	(%r13), %edi
>   36:	addl	$1, %edi
>   39:	movl	%edi, (%r13)
> ; subprog_tail(skb);
>   3d:	movq	%rbx, %rdi		;; rdi = skb
>   40:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
>   47:	callq	0x80			;; call subprog_tail()
> ; subprog_tail(skb);
>   4c:	movq	%rbx, %rdi		;; rdi = skb
>   4f:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
>   56:	callq	0x80			;; call subprog_tail()
> ; return count;
>   5b:	movl	(%r13), %eax
> ; return count;
>   5f:	popq	%r13
>   61:	popq	%rbx
>   62:	leave
>   63:	retq
> 
> int subprog_tail(struct __sk_buff * skb):
> bpf_prog_3a140cef239a4b4f_subprog_tail:
> ; int subprog_tail(struct __sk_buff *skb)
>    0:	endbr64
>    4:	nopl	(%rax)			;; do not touch tail_call_cnt
>    7:	nopl	(%rax,%rax)
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	pushq	%rax			;; [rbp - 8]  = rax (tcc_ptr)
>   15:	pushq	%rax			;; [rbp - 16] = rax (tcc_ptr)
>   16:	pushq	%rbx			;; callee saved
>   17:	pushq	%r13			;; callee saved
>   19:	movq	%rdi, %rbx		;; rbx = skb
> ; asm volatile("r1 = %[ctx]\n\t"
>   1c:	movabsq	$-128494642337280, %r13	;; r13 = jmp_table
>   26:	movq	%rbx, %rdi		;; 1st arg, skb
>   29:	movq	%r13, %rsi		;; 2nd arg, jmp_table
>   2c:	xorl	%edx, %edx		;; 3rd arg, index = 0
>   2e:	movq	-16(%rbp), %rax		;; rax = [rbp - 16] (tcc_ptr)
>   35:	cmpq	$33, (%rax)
>   39:	jae	0x4e			;; if *tcc_ptr >= 33 goto 0x4e --------+
>   3b:	nopl	(%rax,%rax)		;; jmp bypass, toggled by poking       |
>   40:	addq	$1, (%rax)		;; (*tcc_ptr)++                        |
>   44:	popq	%r13			;; callee saved                        |
>   46:	popq	%rbx			;; callee saved                        |
>   47:	popq	%rax			;; undo rbp-16 push                    |
>   48:	popq	%rax			;; undo rbp-8  push                    |
>   49:	jmp	0xfffffffffffffe18	;; tail call target, toggled by poking |
> ; return 0;				;;                                     |
>   4e:	popq	%r13			;; restore callee saved <--------------+
>   50:	popq	%rbx			;; restore callee saved
>   51:	leave
>   52:	retq
> 
> As a result, the tail_call_cnt is stored on the stack of entry_tc. And
> the tail_call_cnt_ptr is propagated between subprog_tc, entry_freplace,
> subprog_tail and entry_tc.
> 
> Furthermore, trampoline is required to propagate
> tail_call_cnt/tail_call_cnt_ptr always, no matter whether there is
> tailcall at run time.
> So, it reuses trampoline flag "BIT(7)" to tell trampoline to propagate
> the tail_call_cnt/tail_call_cnt_ptr, as BPF_TRAMP_F_TAIL_CALL_CTX is not
> used by any other arch BPF JIT.

This change seem to be correct.
Could you please add an explanation somewhere why nop3/xor and nop5
are swapped in the prologue?

As far as I understand, this is done so that freplace program
would reuse xor generated for replaced program, is that right?
E.g. for tailcall_bpf2bpf_freplace test case disasm looks as follows:

--------------- entry_tc --------------
func #0:
0:	f3 0f 1e fa                         	endbr64
4:	48 31 c0                            	xorq	%rax, %rax
7:	0f 1f 44 00 00                      	nopl	(%rax,%rax)
c:	55                                  	pushq	%rbp
d:	48 89 e5                            	movq	%rsp, %rbp
10:	f3 0f 1e fa                         	endbr64
...

------------ entry_freplace -----------
func #0:
0:	f3 0f 1e fa                         	endbr64
4:	0f 1f 00                            	nopl	(%rax)
7:	0f 1f 44 00 00                      	nopl	(%rax,%rax)
c:	55                                  	pushq	%rbp
d:	48 89 e5                            	movq	%rsp, %rbp
...

So, if entry_freplace would be used to replace entry_tc instead
of subprog_tc, the disasm would change to: 

--------------- entry_tc --------------
func #0:
0:	f3 0f 1e fa                         	endbr64
4:	48 31 c0                            	xorq	%rax, %rax
7:	0f 1f 44 00 00                      	jmp <entry_freplace>

Thus reusing %rax initialization from entry_tc.

> However, the bad effect is that it requires initializing tail_call_cnt at
> prologue always no matter whether the prog is tail_call_reachable, because
> it is unable to confirm itself or its subprog[s] whether to be attached by
> freplace prog.
> And, when call subprog, tail_call_cnt_ptr is required to be propagated
> to subprog always.

This seems unfortunate.
I wonder if disallowing to freplace programs when
replacement.tail_call_reachable != replaced.tail_call_reachable
would be a better option?

[...]





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux