On 09/05/2023 22:21, Alexei Starovoitov wrote: > On Tue, May 9, 2023 at 11:44 AM Alan Maguire <alan.maguire@xxxxxxxxxx> wrote: >> >> On 02/05/2023 04:40, Alexei Starovoitov wrote: >>> Alan, >>> >>> wdyt on below? >>> >> >> apologies, missed this; see below.. >> >>> On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: >>>> >>>> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov >>>> <alexei.starovoitov@xxxxxxxxx> wrote: >>>>> >>>>> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: >>>>>> >>>>>> The kernel will panic as follows when attaching fexit to mm_init, >>>>>> >>>>>> [ 86.549700] ------------[ cut here ]------------ >>>>>> [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 >>>>>> [ 86.549713] #PF: supervisor read access in kernel mode >>>>>> [ 86.549715] #PF: error_code(0x0000) - not-present page >>>>>> [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 >>>>>> [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI >>>>>> [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 >>>>>> [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 >>>>>> [ 86.549754] Call Trace: >>>>>> [ 86.549755] <TASK> >>>>>> [ 86.549757] check_preempt_curr+0x5e/0x70 >>>>>> [ 86.549761] ttwu_do_activate+0xab/0x350 >>>>>> [ 86.549763] try_to_wake_up+0x314/0x680 >>>>>> [ 86.549765] wake_up_process+0x15/0x20 >>>>>> [ 86.549767] insert_work+0xb2/0xd0 >>>>>> [ 86.549772] __queue_work+0x20a/0x400 >>>>>> [ 86.549774] queue_work_on+0x7b/0x90 >>>>>> [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] >>>>>> [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] >>>>>> [ 86.549813] soft_cursor+0x1cb/0x250 >>>>>> [ 86.549816] bit_cursor+0x3ce/0x630 >>>>>> [ 86.549818] fbcon_cursor+0x139/0x1c0 >>>>>> [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 >>>>>> [ 86.549822] hide_cursor+0x31/0xd0 >>>>>> [ 86.549825] vt_console_print+0x477/0x4e0 >>>>>> [ 86.549828] console_flush_all+0x182/0x440 >>>>>> [ 86.549832] console_unlock+0x58/0xf0 >>>>>> [ 86.549834] vprintk_emit+0x1ae/0x200 >>>>>> [ 86.549837] vprintk_default+0x1d/0x30 >>>>>> [ 86.549839] vprintk+0x5c/0x90 >>>>>> [ 86.549841] _printk+0x58/0x80 >>>>>> [ 86.549843] __warn_printk+0x7e/0x1a0 >>>>>> [ 86.549845] ? trace_preempt_off+0x1b/0x70 >>>>>> [ 86.549848] ? trace_preempt_on+0x1b/0x70 >>>>>> [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 >>>>>> [ 86.549853] refcount_warn_saturate+0x9f/0x150 >>>>>> [ 86.549855] mm_init+0x379/0x390 >>>>>> [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 >>>>>> [ 86.549862] mm_init+0x5/0x390 >>>>>> [ 86.549865] ? mm_alloc+0x4e/0x60 >>>>>> [ 86.549866] alloc_bprm+0x8a/0x2e0 >>>>>> [ 86.549869] do_execveat_common.isra.0+0x67/0x240 >>>>>> [ 86.549872] __x64_sys_execve+0x37/0x50 >>>>>> [ 86.549874] do_syscall_64+0x38/0x90 >>>>>> [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc >>>>>> >>>>>> The reason is that when we attach the btf id of the function mm_init we >>>>>> actually attach the mm_init defined in init/main.c rather than the >>>>>> function defined in kernel/fork.c. That can be proved by parsing >>>>>> /sys/kernel/btf/vmlinux: >>>>>> >>>>>> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static >>>>>> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 >>>>>> 'buf' type_id=57 >>>>>> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static >>>>>> [2496] FUNC 'mm_init' type_id=118 linkage=static >>>>>> [2497] FUNC 'trap_init' type_id=118 linkage=static >>>>>> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static >>>>>> >>>>>> From the above information we can find that the FUNCs above and below >>>>>> mm_init are all defined in init/main.c. So there's no doubt that the >>>>>> mm_init is also the function defined in init/main.c. >>>>>> >>>>>> So when a task calls mm_init and thus the bpf trampoline is triggered it >>>>>> will use the information of the mm_init defined in init/main.c. Then the >>>>>> panic will occur. >>>>>> >>>>>> It seems that there're issues in btf, for example it is unnecessary to >>>>>> generate btf for the functions annonated with __init. We need to improve >>>>>> btf. However we also need to change the function defined in >>>>>> kernel/fork.c to task_mm_init to better distinguish them. After it is >>>>>> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: >>>>>> >>>>>> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static >>>>>> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 >>>>>> 'mm' type_id=204 >>>>>> 'p' type_id=197 >>>>>> 'user_ns' type_id=452 >>>>>> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static >>>>>> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static >>>>>> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 >>>>>> 'orig' type_id=197 >>>>>> 'node' type_id=21 >>>>>> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static >>>>>> >>>>>> And then attaching task_mm_init won't panic. Improving the btf will be >>>>>> handled later. >>>>> >>>>> We're not going to hack the kernel to workaround pahole issue. >>>>> Let's fix pahole instead. >>>>> cc-ing Alan for ideas. >>>> >>>> Any comment on it, Alan ? >>>> I think we can just skip generating BTF for the functions in >>>> __section(".init.text"), as these functions will be freed after >>>> kernel init. There won't be use cases for them. >>>> >> >> won't the pahole v1.25 changes help here; can you try applying >> >> https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@xxxxxxxxxx/ >> >> ...and build using pahole; this should eliminate any functions >> with inconsistent prototypes via >> >> --skip_encoding_btf_inconsistent_proto >> Do not encode functions with multiple inconsistent prototypes or >> unexpected register use for their parameters, where the regis‐ >> ters used do not match calling conventions. >> >> >> I'll check this at my end too. >> >> Alexei, if this works should we look at applying the above >> again to bpf-next? If so I'll resend the patch. > > I've lost the track with pahole fixes. > Did Arnaldo re-tag pahole 1.25 or released 1.26 with the fixes? > My understanding is the pahole repo is prepped for v1.25, meaning that it will build with version 1.25 but it is not officially released yet; Arnaldo will correct me if I've got that wrong. > Alan, > please submit a fresh patch for bpf-next to enable > --skip_encoding_btf_inconsistent_proto, so it can go through CI. > I cannot test all combinations manually. > Done; see [1]. If CI picks up the latest version from the dwarves repo, it will see version 1.25. I've tested the above specific case along with general testing using latest pahole. When running pahole with --verbose and --skip_encoding_btf_inconsistent_proto we see: skipping addition of 'mm_init'(mm_init) due to multiple inconsistent function prototypes ...and bpftool btf dump file vmlinux |grep mm_init shows the function is not encoded in BTF due to these inconsistencies. Thanks! Alan [1] https://lore.kernel.org/bpf/20230510130241.1696561-1-alan.maguire@xxxxxxxxxx/