Re: [BUG] Kernel Crash during replacement of livepatch patching do_exit()

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 24 Jan 2025 15:07:18 +0800

On Fri, Jan 24, 2025 at 11:11 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> On Thu, Jan 23, 2025 at 11:55 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> >
> > On Wed 2025-01-22 16:56:31, Petr Mladek wrote:
> > > Anyway, I am working on a POC which would allow to track
> > > to-be-released processes. It would finish the transition only
> > > when all the to-be-released processes already use the new code.
> > > It won't allow to remove the disabled livepatch prematurely.
> >
> > Here is the POC. I am not sure if it is the right path, see
> > the warning and open questions in the commit message.
> >
> > I am going to wait for some feedback before investing more time
> > into this.
> >
> > The patch is against current Linus' master, aka, v6.13-rc7.
> >
> > From ac7287d99aaeca7a4536e8ade61b9bd0c8ec7fdc Mon Sep 17 00:00:00 2001
> > From: Petr Mladek <pmladek@xxxxxxxx>
> > Date: Thu, 23 Jan 2025 09:04:09 +0100
> > Subject: [PATCH] livepatching: Block transition until
> >  delayed_put_task_struct()
> >
> > WARNING: This patch is just a POC. It will blow up the system because
> >         RCU callbacks are handled by softirq context which are handled
> >         by default on exit from IRQ handlers. And it is not allowed
> >         to take sleeping locks here, see the backtrace at the end
> >         of the commit message.
> >
> >         We would need to synchronize the counting of the exiting
> >         processes with the livepatch transition another way.
> >
> >         Hmm, I guess that spin_lock is legal in softirq context.
> >         It might be the easiest approach.
> >
> >         In the worst case, we would need to use a lock less
> >         algorithm which might make things even more complicated.
> >
> > Here is the description of the problem and the solution:
> >
> > The livepatching core code uses for_each_process_thread() cycle for setting
> > and checking the state of processes on the system. It works well as long
> > as the livepatch touches only code which is used only by processes in
> > the task list.
> >
> > The problem starts when the livepatch replaces code which might be
> > used by a process which is not longer in the task list. It is
> > mostly about the processes which are being removed. They
> > disapper from the list here:
> >
> >         + release_task()
> >           + __exit_signal()
> >             + __unhash_process()
> >
> > There are basically two groups of problems:
> >
> > 1. The livepatching core does not longer updates TIF_PATCH_PENDING
> >    and p->patch_state. In this case, the ftrace handler
> >    klp_check_stack_func() might do wrong decision and
> >    use an incompatible variant of the code.
>
> I believe I was able to reproduce the issue while attempting to
> trigger the panic. The warning message is as follows:
>
> The warning occurred at the following location:
>
>  klp_ftrace_handler
>       if (unlikely(func->transition)) {
>           WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
>   }
>
> [58063.291589] livepatch: enabling patch 'livepatch_61_release12'
> [58063.297580] livepatch: 'livepatch_61_release12': starting patching transition
> [58063.323340] ------------[ cut here ]------------
> [58063.323343] WARNING: CPU: 58 PID: 3851051 at
> kernel/livepatch/patch.c:98 klp_ftrace_handler+0x136/0x150
> [58063.323349] Modules linked in: livepatch_61_release12(OK)
> livepatch_61_release6(OK) ebtable_filter ebtables af_packet_diag
> netlink_diag xt_DSCP xt_owner iptable_mangle raw_diag unix_diag
> udp_diag iptable_raw xt_CT tcp_diag inet_diag cls_bpf sch_ingress
> bpf_preload binfmt_misc iptable_filter bpfilter xt_conntrack nf_nat
> nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 overlay af_packet bonding tls intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm irqbypass rapl vfat intel_cstate fat iTCO_wdt
> xfs intel_uncore pcspkr ses enclosure input_leds i2c_i801 i2c_smbus
> mei_me lpc_ich ioatdma mei mfd_core intel_pch_thermal dca acpi_ipmi
> wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq acpi_pad
> acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod sg mpt3sas
> raid_class scsi_transport_sas megaraid_sas crct10dif_pclmul
> crc32_pclmul crc32c_intel
> [58063.323402]  polyval_clmulni polyval_generic ghash_clmulni_intel
> sha512_ssse3 aesni_intel nvme crypto_simd cryptd nvme_core t10_pi i40e
> ptp pps_core ahci libahci libata deflate zlib_deflate
> [58063.323413] Unloaded tainted modules:
> livepatch_61_release6(OK):3369 livepatch_61_release12(OK):3370 [last
> unloaded: livepatch_61_release12(OK)]
> [58063.323418] CPU: 58 PID: 3851051 Comm: docker Kdump: loaded
> Tainted: G S         O  K    6.1.52-3
> [58063.323421] Hardware name: Inspur SA5212M5/SA5212M5, BIOS 4.1.20 05/05/2021
> [58063.323423] RIP: 0010:klp_ftrace_handler+0x136/0x150
> [58063.323425] Code: eb b3 0f 1f 44 00 00 eb b5 8b 89 f4 23 00 00 83
> f9 ff 74 16 85 c9 75 89 48 8b 00 48 8d 50 90 48 39 c6 0f 85 79 ff ff
> ff eb 8b <0f> 0b e9 70 ff ff ff e8 ae 24 9c 00 66 66 2e 0f 1f 84 00 00
> 00 00
> [58063.323428] RSP: 0018:ffffa87b2367fbb8 EFLAGS: 00010046
> [58063.323429] RAX: ffff8b0ee59229a0 RBX: ffff8b26fa47b000 RCX: 00000000ffffffff
> [58063.323432] RDX: ffff8b0ee5922930 RSI: ffff8b2d41e53f10 RDI: ffffa87b2367fbd8
> [58063.323433] RBP: ffffa87b2367fbc8 R08: 0000000000000000 R09: fffffffffffffff7
> [58063.323434] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b2520eaf240
> [58063.323435] R13: ffff8b26fa47b000 R14: 000000000002f240 R15: 0000000000000000
> [58063.323436] FS:  0000000000000000(0000) GS:ffff8b2520e80000(0000)
> knlGS:0000000000000000
> [58063.323438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [58063.323440] CR2: 00007f19d7eaf000 CR3: 000000187a676004 CR4: 00000000007706e0
> [58063.323441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [58063.323442] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [58063.323444] PKRU: 55555554
> [58063.323445] Call Trace:
> [58063.323445]  <TASK>
> [58063.323449]  ? show_regs.cold+0x1a/0x1f
> [58063.323454]  ? klp_ftrace_handler+0x136/0x150
> [58063.323455]  ? __warn+0x84/0xd0
> [58063.323457]  ? klp_ftrace_handler+0x136/0x150
> [58063.323459]  ? report_bug+0x105/0x180
> [58063.323463]  ? handle_bug+0x40/0x70
> [58063.323467]  ? exc_invalid_op+0x19/0x70
> [58063.323469]  ? asm_exc_invalid_op+0x1b/0x20
> [58063.323474]  ? klp_ftrace_handler+0x136/0x150
> [58063.323476]  ? kmem_cache_free+0x155/0x470
> [58063.323479]  0xffffffffc0876099
> [58063.323495]  ? update_rq_clock+0x5/0x250
> [58063.323498]  update_rq_clock+0x5/0x250
> [58063.323500]  __schedule+0xed/0x8f0
> [58063.323504]  ? update_rq_clock+0x5/0x250
> [58063.323506]  ? __schedule+0xed/0x8f0
> [58063.323508]  ? trace_hardirqs_off+0x36/0xf0
> [58063.323512]  do_task_dead+0x44/0x50
> [58063.323515]  do_exit+0x7cd/0xb90 [livepatch_61_release6]
> [58063.323525]  ? xfs_inode_mark_reclaimable+0x320/0x320 [livepatch_61_release6]
> [58063.323533]  do_group_exit+0x35/0x90
> [58063.323536]  get_signal+0x909/0x950
> [58063.323539]  ? get_futex_key+0xa4/0x4f0
> [58063.323543]  arch_do_signal_or_restart+0x34/0x2a0
> [58063.323547]  exit_to_user_mode_prepare+0x149/0x1b0
> [58063.323551]  syscall_exit_to_user_mode+0x1e/0x50
> [58063.323555]  do_syscall_64+0x48/0x90
> [58063.323557]  entry_SYSCALL_64_after_hwframe+0x64/0xce
> [58063.323560] RIP: 0033:0x5601df8122f3
> [58063.323563] Code: Unable to access opcode bytes at 0x5601df8122c9.
> [58063.323564] RSP: 002b:00007f9ce6ffccc0 EFLAGS: 00000286 ORIG_RAX:
> 00000000000000ca
> [58063.323566] RAX: fffffffffffffe00 RBX: 000000c42053e000 RCX: 00005601df8122f3
> [58063.323567] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c42053e148
> [58063.323568] RBP: 00007f9ce6ffcd08 R08: 0000000000000000 R09: 0000000000000000
> [58063.323569] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000
> [58063.323570] R13: 0000000000801000 R14: 0000000000000000 R15: 00007f9ce6ffd700
> [58063.323573]  </TASK>
> [58063.323574] ---[ end trace 0000000000000000 ]---
>
> >
> >    This might become a real problem only when the code modifies
> >    the semantic.
> >
> > 2. The livepatching core does not longer check the stack and
> >    could finish the livepatch transition even when these
> >    to-be-removed processes have not been transitioned yet.
> >
> >    This might even cause Oops when the to-be-removed processes
> >    are running a code from a disabled livepatch which might
> >    be removed in the meantime.
> >
> > This patch tries to address the 2nd problem which most likely caused
> > the following crash:
> > [...]
> >
> > This patch tries to avoid the crash by tracking the number of
> > to-be-released processes. They block the current transition
> > until delayed_put_task_struct() is called.
> >
> > It is just a POC. There are many open questions:
> >
> > 1. Does it help at all?
> >
> >    It looks to me that release_task() is always called from another
> >    task. For example, exit_notify() seems to call it for a dead
> >    childern. It is not clear to me whether the released task
> >    is still running do_exit() at this point.
>
> If the task is a thread (but not the thread group leader), it should
> call release_task(), correct? Below is the trace from our production
> server:
>
> $ bpftrace -e 'k:release_task {$tsk=(struct task_struct *)arg0; if
> ($tsk->pid == tid){@stack[kstack()]=count()}}'
> @stack[
>     release_task+1
>     kthread_exit+41
>     kthread+200
>     ret_from_fork+31
> ]: 1
> @stack[
>     release_task+1
>     do_group_exit+53
>     __x64_sys_exit_group+24
>     do_syscall_64+56
>     entry_SYSCALL_64_after_hwframe+100
> ]: 2
> @stack[
>     release_task+1
>     do_group_exit+53
>     get_signal+2313
>     arch_do_signal_or_restart+52
>     exit_to_user_mode_prepare+329
>     syscall_exit_to_user_mode+30
>     do_syscall_64+72
>     entry_SYSCALL_64_after_hwframe+100
> ]: 20
> @stack[
>     release_task+1
>     __x64_sys_exit+27
>     do_syscall_64+56
>     entry_SYSCALL_64_after_hwframe+100
> ]: 26
>
> >
> >    Well, for example, wait_task_zombie() calls release_task()
> >    in some special cases.
> >
> > 2. Is it enough to block the transition until delayed_put_task_struct()?
> >
> >    I do not fully understand the maze of code. It might still be too
> >    early.
> >
> >    It seems that put_task_struct() can delay the release even more.
>
> After delayed_put_task_struct(), there is still some code that needs
> to be executed. It appears that this just reduces the likelihood of
> the issue occurring, but does not completely prevent it.

The last function to be executed is do_task_dead(), and this function
will not return to do_exit() again. What if we define this function as
"__noinline __nopatchable" and perform the final check within it?

The "__noinline" attribute ensures it won’t be inlined into do_exit(),
while "__nopatchable" guarantees that it can't be livepatched.

Something as follows?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6569,6 +6569,7 @@ void __noreturn do_task_dead(void)
        /* Tell freezer to ignore us: */
        current->flags |= PF_NOFREEZE;

+       klp_put_releasing_task();
        __schedule(SM_NONE);
        BUG();


--
Regards
Yafang