On Wed 2025-01-22 22:01:31, Yafang Shao wrote: > On Wed, Jan 22, 2025 at 9:30 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > On Wed, Jan 22, 2025 at 7:45 PM Petr Mladek <pmladek@xxxxxxxx> wrote: > > > > > > On Wed 2025-01-22 14:36:55, Yafang Shao wrote: > > > > On Tue, Jan 21, 2025 at 5:38 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > > > Hello, > > > > > > > > > > We encountered a panic while upgrading our livepatch, specifically > > > > > replacing an old livepatch with a new version on our production > > > > > servers. > > > > > > > > My theory is that the transition has finished and some other process > > > started removing the older livepatch module. I guess that the memory > > > with the livepatch_61_release6 code has been freed on another CPU. > > > > > > It would cause a crash of a process still running the freed do_exit() > > > function. The process would not block the transition after it was > > > removed from the task list in the middle of do_exit(). > > > > > > Maybe, you could confirm this in the existing crash dump. > > > > That's correct, I can confirm this. Below are the details: > > > > crash> bt > > PID: 783972 TASK: ffff94cd316f0000 CPU: 70 COMMAND: "java" > > #0 [ffffba6f273db9a8] machine_kexec at ffffffff990632ad > > #1 [ffffba6f273dba08] __crash_kexec at ffffffff9915c8af > > #2 [ffffba6f273dbad0] crash_kexec at ffffffff9915db0c > > #3 [ffffba6f273dbae0] oops_end at ffffffff99024bc9 > > #4 [ffffba6f273dbaf0] _MODULE_START_livepatch_61_release6 at > > ffffffffc0ded7fa [livepatch_61_release6] > > #5 [ffffba6f273dbb80] _MODULE_START_livepatch_61_release6 at > > ffffffffc0ded7fa [livepatch_61_release6] > > #6 [ffffba6f273dbbf8] _MODULE_START_livepatch_61_release6 at > > ffffffffc0ded7fa [livepatch_61_release6] > > #7 [ffffba6f273dbc80] asm_exc_page_fault at ffffffff99c00bb7 > > [exception RIP: _MODULE_START_livepatch_61_release6+14330] > > RIP: ffffffffc0ded7fa RSP: ffffba6f273dbd30 RFLAGS: 00010282 > > > > crash> task_struct.tgid ffff94cd316f0000 > > tgid = 783848, > > > > crash> task_struct.tasks -o init_task > > struct task_struct { > > [ffffffff9ac1b310] struct list_head tasks; > > } > > > > crash> list task_struct.tasks -H ffffffff9ac1b310 -s task_struct.tgid > > | grep 783848 > > tgid = 783848, > > > > The thread group leader remains on the task list, but the thread has > > already been removed from the thread_head list. > > > > crash> task 783848 > > PID: 783848 TASK: ffff94cd603eb000 CPU: 18 COMMAND: "java" > > struct task_struct { > > thread_info = { > > flags = 16388, > > > > crash> task_struct.signal ffff94cd603eb000 > > signal = 0xffff94cc89d11b00, > > > > crash> signal_struct.thread_head -o 0xffff94cc89d11b00 > > struct signal_struct { > > [ffff94cc89d11b10] struct list_head thread_head; > > } > > > > crash> list task_struct.thread_node -H ffff94cc89d11b10 -s task_struct.pid > > ffff94cd603eb000 > > pid = 783848, > > ffff94ccd8343000 > > pid = 783879, > > > > crash> signal_struct.nr_threads,thread_head 0xffff94cc89d11b00 > > nr_threads = 2, > > thread_head = { > > next = 0xffff94cd603eba70, > > prev = 0xffff94ccd8343a70 > > }, > > > > crash> ps -g 783848 > > PID: 783848 TASK: ffff94cd603eb000 CPU: 18 COMMAND: "java" > > PID: 783879 TASK: ffff94ccd8343000 CPU: 81 COMMAND: "java" > > PID: 783972 TASK: ffff94cd316f0000 CPU: 70 COMMAND: "java" > > PID: 784023 TASK: ffff94d644b48000 CPU: 24 COMMAND: "java" > > PID: 784025 TASK: ffff94dd30250000 CPU: 65 COMMAND: "java" > > PID: 785242 TASK: ffff94ccb5963000 CPU: 48 COMMAND: "java" > > PID: 785412 TASK: ffff94cd3eaf8000 CPU: 92 COMMAND: "java" > > PID: 785415 TASK: ffff94cd6606b000 CPU: 23 COMMAND: "java" > > PID: 785957 TASK: ffff94dfea4e3000 CPU: 16 COMMAND: "java" > > PID: 787125 TASK: ffff94e70547b000 CPU: 27 COMMAND: "java" > > PID: 787445 TASK: ffff94e49a2bb000 CPU: 28 COMMAND: "java" > > PID: 787502 TASK: ffff94e41e0f3000 CPU: 36 COMMAND: "java" > > > > It seems like fixing this will be a challenging task. Could you please check if another CPU or process is running "rmmod" which is removing the replaced livepatch_61_release6, please? > > Hello Petr, > > I believe this case highlights the need for a hybrid livepatch > mode—where we allow the coexistence of atomic-replace and > non-atomic-replace patches. If a livepatch is set to non-replaceable, > it should neither be replaced by other livepatches nor replace any > other patches itself. > > We’ve deployed this livepatch, including the change to do_exit(), to > nearly all of our servers—hundreds of thousands in total. It’s a real > tragedy that we can't unload it. Moving forward, we’ll have no choice > but to create non-atomic-replace livepatches to avoid this issue... If my theory is correct then a workaround would be to keep the replaced livepatch module loaded until all pending do_exit() calls are finished. So that it stays in the memory as long as the code is accessed. It might be enough to update the scripting and call the rmmod after some delay. I doubt that a non-atomic-replace patches would make the life easier. They would just create even more complicated scenario. But I might be wrong. Anyway, I am working on a POC which would allow to track to-be-released processes. It would finish the transition only when all the to-be-released processes already use the new code. It won't allow to remove the disabled livepatch prematurely. Best Regards, Petr