On Fri, Jan 24, 2025 at 11:11 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On Thu, Jan 23, 2025 at 11:55 PM Petr Mladek <pmladek@xxxxxxxx> wrote: > > > > On Wed 2025-01-22 16:56:31, Petr Mladek wrote: > > > Anyway, I am working on a POC which would allow to track > > > to-be-released processes. It would finish the transition only > > > when all the to-be-released processes already use the new code. > > > It won't allow to remove the disabled livepatch prematurely. > > > > Here is the POC. I am not sure if it is the right path, see > > the warning and open questions in the commit message. > > > > I am going to wait for some feedback before investing more time > > into this. > > > > The patch is against current Linus' master, aka, v6.13-rc7. > > > > From ac7287d99aaeca7a4536e8ade61b9bd0c8ec7fdc Mon Sep 17 00:00:00 2001 > > From: Petr Mladek <pmladek@xxxxxxxx> > > Date: Thu, 23 Jan 2025 09:04:09 +0100 > > Subject: [PATCH] livepatching: Block transition until > > delayed_put_task_struct() > > > > WARNING: This patch is just a POC. It will blow up the system because > > RCU callbacks are handled by softirq context which are handled > > by default on exit from IRQ handlers. And it is not allowed > > to take sleeping locks here, see the backtrace at the end > > of the commit message. > > > > We would need to synchronize the counting of the exiting > > processes with the livepatch transition another way. > > > > Hmm, I guess that spin_lock is legal in softirq context. > > It might be the easiest approach. > > > > In the worst case, we would need to use a lock less > > algorithm which might make things even more complicated. > > > > Here is the description of the problem and the solution: > > > > The livepatching core code uses for_each_process_thread() cycle for setting > > and checking the state of processes on the system. It works well as long > > as the livepatch touches only code which is used only by processes in > > the task list. > > > > The problem starts when the livepatch replaces code which might be > > used by a process which is not longer in the task list. It is > > mostly about the processes which are being removed. They > > disapper from the list here: > > > > + release_task() > > + __exit_signal() > > + __unhash_process() > > > > There are basically two groups of problems: > > > > 1. The livepatching core does not longer updates TIF_PATCH_PENDING > > and p->patch_state. In this case, the ftrace handler > > klp_check_stack_func() might do wrong decision and > > use an incompatible variant of the code. > > I believe I was able to reproduce the issue while attempting to > trigger the panic. The warning message is as follows: > > The warning occurred at the following location: > > klp_ftrace_handler > if (unlikely(func->transition)) { > WARN_ON_ONCE(patch_state == KLP_UNDEFINED); > } > > [58063.291589] livepatch: enabling patch 'livepatch_61_release12' > [58063.297580] livepatch: 'livepatch_61_release12': starting patching transition > [58063.323340] ------------[ cut here ]------------ > [58063.323343] WARNING: CPU: 58 PID: 3851051 at > kernel/livepatch/patch.c:98 klp_ftrace_handler+0x136/0x150 > [58063.323349] Modules linked in: livepatch_61_release12(OK) > livepatch_61_release6(OK) ebtable_filter ebtables af_packet_diag > netlink_diag xt_DSCP xt_owner iptable_mangle raw_diag unix_diag > udp_diag iptable_raw xt_CT tcp_diag inet_diag cls_bpf sch_ingress > bpf_preload binfmt_misc iptable_filter bpfilter xt_conntrack nf_nat > nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 > nf_defrag_ipv4 overlay af_packet bonding tls intel_rapl_msr > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common > isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp > coretemp kvm_intel kvm irqbypass rapl vfat intel_cstate fat iTCO_wdt > xfs intel_uncore pcspkr ses enclosure input_leds i2c_i801 i2c_smbus > mei_me lpc_ich ioatdma mei mfd_core intel_pch_thermal dca acpi_ipmi > wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq acpi_pad > acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod sg mpt3sas > raid_class scsi_transport_sas megaraid_sas crct10dif_pclmul > crc32_pclmul crc32c_intel > [58063.323402] polyval_clmulni polyval_generic ghash_clmulni_intel > sha512_ssse3 aesni_intel nvme crypto_simd cryptd nvme_core t10_pi i40e > ptp pps_core ahci libahci libata deflate zlib_deflate > [58063.323413] Unloaded tainted modules: > livepatch_61_release6(OK):3369 livepatch_61_release12(OK):3370 [last > unloaded: livepatch_61_release12(OK)] > [58063.323418] CPU: 58 PID: 3851051 Comm: docker Kdump: loaded > Tainted: G S O K 6.1.52-3 > [58063.323421] Hardware name: Inspur SA5212M5/SA5212M5, BIOS 4.1.20 05/05/2021 > [58063.323423] RIP: 0010:klp_ftrace_handler+0x136/0x150 > [58063.323425] Code: eb b3 0f 1f 44 00 00 eb b5 8b 89 f4 23 00 00 83 > f9 ff 74 16 85 c9 75 89 48 8b 00 48 8d 50 90 48 39 c6 0f 85 79 ff ff > ff eb 8b <0f> 0b e9 70 ff ff ff e8 ae 24 9c 00 66 66 2e 0f 1f 84 00 00 > 00 00 > [58063.323428] RSP: 0018:ffffa87b2367fbb8 EFLAGS: 00010046 > [58063.323429] RAX: ffff8b0ee59229a0 RBX: ffff8b26fa47b000 RCX: 00000000ffffffff > [58063.323432] RDX: ffff8b0ee5922930 RSI: ffff8b2d41e53f10 RDI: ffffa87b2367fbd8 > [58063.323433] RBP: ffffa87b2367fbc8 R08: 0000000000000000 R09: fffffffffffffff7 > [58063.323434] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b2520eaf240 > [58063.323435] R13: ffff8b26fa47b000 R14: 000000000002f240 R15: 0000000000000000 > [58063.323436] FS: 0000000000000000(0000) GS:ffff8b2520e80000(0000) > knlGS:0000000000000000 > [58063.323438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [58063.323440] CR2: 00007f19d7eaf000 CR3: 000000187a676004 CR4: 00000000007706e0 > [58063.323441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [58063.323442] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [58063.323444] PKRU: 55555554 > [58063.323445] Call Trace: > [58063.323445] <TASK> > [58063.323449] ? show_regs.cold+0x1a/0x1f > [58063.323454] ? klp_ftrace_handler+0x136/0x150 > [58063.323455] ? __warn+0x84/0xd0 > [58063.323457] ? klp_ftrace_handler+0x136/0x150 > [58063.323459] ? report_bug+0x105/0x180 > [58063.323463] ? handle_bug+0x40/0x70 > [58063.323467] ? exc_invalid_op+0x19/0x70 > [58063.323469] ? asm_exc_invalid_op+0x1b/0x20 > [58063.323474] ? klp_ftrace_handler+0x136/0x150 > [58063.323476] ? kmem_cache_free+0x155/0x470 > [58063.323479] 0xffffffffc0876099 > [58063.323495] ? update_rq_clock+0x5/0x250 > [58063.323498] update_rq_clock+0x5/0x250 > [58063.323500] __schedule+0xed/0x8f0 > [58063.323504] ? update_rq_clock+0x5/0x250 > [58063.323506] ? __schedule+0xed/0x8f0 > [58063.323508] ? trace_hardirqs_off+0x36/0xf0 > [58063.323512] do_task_dead+0x44/0x50 > [58063.323515] do_exit+0x7cd/0xb90 [livepatch_61_release6] > [58063.323525] ? xfs_inode_mark_reclaimable+0x320/0x320 [livepatch_61_release6] > [58063.323533] do_group_exit+0x35/0x90 > [58063.323536] get_signal+0x909/0x950 > [58063.323539] ? get_futex_key+0xa4/0x4f0 > [58063.323543] arch_do_signal_or_restart+0x34/0x2a0 > [58063.323547] exit_to_user_mode_prepare+0x149/0x1b0 > [58063.323551] syscall_exit_to_user_mode+0x1e/0x50 > [58063.323555] do_syscall_64+0x48/0x90 > [58063.323557] entry_SYSCALL_64_after_hwframe+0x64/0xce > [58063.323560] RIP: 0033:0x5601df8122f3 > [58063.323563] Code: Unable to access opcode bytes at 0x5601df8122c9. > [58063.323564] RSP: 002b:00007f9ce6ffccc0 EFLAGS: 00000286 ORIG_RAX: > 00000000000000ca > [58063.323566] RAX: fffffffffffffe00 RBX: 000000c42053e000 RCX: 00005601df8122f3 > [58063.323567] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c42053e148 > [58063.323568] RBP: 00007f9ce6ffcd08 R08: 0000000000000000 R09: 0000000000000000 > [58063.323569] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000 > [58063.323570] R13: 0000000000801000 R14: 0000000000000000 R15: 00007f9ce6ffd700 > [58063.323573] </TASK> > [58063.323574] ---[ end trace 0000000000000000 ]--- > > > > > This might become a real problem only when the code modifies > > the semantic. > > > > 2. The livepatching core does not longer check the stack and > > could finish the livepatch transition even when these > > to-be-removed processes have not been transitioned yet. > > > > This might even cause Oops when the to-be-removed processes > > are running a code from a disabled livepatch which might > > be removed in the meantime. > > > > This patch tries to address the 2nd problem which most likely caused > > the following crash: > > [...] > > > > This patch tries to avoid the crash by tracking the number of > > to-be-released processes. They block the current transition > > until delayed_put_task_struct() is called. > > > > It is just a POC. There are many open questions: > > > > 1. Does it help at all? > > > > It looks to me that release_task() is always called from another > > task. For example, exit_notify() seems to call it for a dead > > childern. It is not clear to me whether the released task > > is still running do_exit() at this point. > > If the task is a thread (but not the thread group leader), it should > call release_task(), correct? Below is the trace from our production > server: > > $ bpftrace -e 'k:release_task {$tsk=(struct task_struct *)arg0; if > ($tsk->pid == tid){@stack[kstack()]=count()}}' > @stack[ > release_task+1 > kthread_exit+41 > kthread+200 > ret_from_fork+31 > ]: 1 > @stack[ > release_task+1 > do_group_exit+53 > __x64_sys_exit_group+24 > do_syscall_64+56 > entry_SYSCALL_64_after_hwframe+100 > ]: 2 > @stack[ > release_task+1 > do_group_exit+53 > get_signal+2313 > arch_do_signal_or_restart+52 > exit_to_user_mode_prepare+329 > syscall_exit_to_user_mode+30 > do_syscall_64+72 > entry_SYSCALL_64_after_hwframe+100 > ]: 20 > @stack[ > release_task+1 > __x64_sys_exit+27 > do_syscall_64+56 > entry_SYSCALL_64_after_hwframe+100 > ]: 26 > > > > > Well, for example, wait_task_zombie() calls release_task() > > in some special cases. > > > > 2. Is it enough to block the transition until delayed_put_task_struct()? > > > > I do not fully understand the maze of code. It might still be too > > early. > > > > It seems that put_task_struct() can delay the release even more. > > After delayed_put_task_struct(), there is still some code that needs > to be executed. It appears that this just reduces the likelihood of > the issue occurring, but does not completely prevent it. The last function to be executed is do_task_dead(), and this function will not return to do_exit() again. What if we define this function as "__noinline __nopatchable" and perform the final check within it? The "__noinline" attribute ensures it won’t be inlined into do_exit(), while "__nopatchable" guarantees that it can't be livepatched. Something as follows? --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6569,6 +6569,7 @@ void __noreturn do_task_dead(void) /* Tell freezer to ignore us: */ current->flags |= PF_NOFREEZE; + klp_put_releasing_task(); __schedule(SM_NONE); BUG(); -- Regards Yafang