After talking with Peter, this seems like it might be a potential approach to fix the issue for kernels both with PREEMPT enabled and disabled. If this looks like a reasonable approach to people, we can run experiments with this patch on a few thousand systems, and compare it with the kernel live patch transition latencies (and number of failures) on a kernel without that patch. Does this look like an approach that could work? ---8<--- sched,livepatch: call stop_one_cpu in klp_check_and_switch_task If a running task fails to transition to the new kernel live patch after the first attempt, use the stopper thread to preempt it during subsequent attempts at switching to the new kernel live patch. <INSERT EXPERIMENTAL RESULTS HERE> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index 5d03a2ad1066..26e9e5f09822 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -9,6 +9,7 @@ #include <linux/cpu.h> #include <linux/stacktrace.h> +#include <linux/stop_machine.h> #include "core.h" #include "patch.h" #include "transition.h" @@ -281,6 +282,11 @@ static int klp_check_and_switch_task(struct task_struct *task, void *arg) return 0; } +static int kpatch_dummy_fn(void *dummy) +{ + return 0; +} + /* * Try to safely switch a task to the target patch state. If it's currently * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or @@ -315,6 +321,9 @@ static bool klp_try_switch_task(struct task_struct *task) case -EBUSY: /* klp_check_and_switch_task() */ pr_debug("%s: %s:%d is running\n", __func__, task->comm, task->pid); + /* Preempt the task from the second KLP switch attempt. */ + if (klp_signals_cnt) + stop_one_cpu(task_cpu(task), kpatch_dummy_fn, NULL); break; case -EINVAL: /* klp_check_and_switch_task() */ pr_debug("%s: %s:%d has an unreliable stack\n",