On Thu 2023-02-09 11:17:48, Josh Poimboeuf wrote: > There have been reports [1][2] of live patches failing to complete > within a reasonable amount of time due to CPU-bound kthreads. > > Fix it by patching tasks in cond_resched(). > > There are four different flavors of cond_resched(), depending on the > kernel configuration. Hook into all of them. > > A more elegant solution might be to use a preempt notifier. However, > non-ORC unwinders can't unwind a preempted task reliably. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2085,20 +2090,25 @@ static __always_inline int _cond_resched(void) > return dynamic_cond_resched(); > } > > -#else > +#else /* !CONFIG_PREEMPTION */ > > static inline int _cond_resched(void) > { > + klp_sched_try_switch(); > return __cond_resched(); My only concern is if it might cause any performance problems. On one hand, cond_resched() is used in code paths that are slow on its own. Also it will do nothing most of the time. On the other hand, cond_resched() is typically used in cycles. One cycle might be fast. The code might be slow because there are too many cycles. Repeating the same failing test might prolong the time significantly. An idea is to try the switch only when it was not done during a real schedule. Something like: static inline int _cond_resched(void) { int scheduled; scheduled = __cond_resched(); if (scheduled) klp_sched_try_switch(); return scheduled(); } But it would make it less reliable/predictable. Also it won't work in configurations when cond_resched() is always a nop. I am probably too careful. We might keep it simple until any real life problems are reported. > --- a/kernel/livepatch/transition.c > +++ b/kernel/livepatch/transition.c > @@ -76,6 +96,8 @@ static void klp_complete_transition(void) > klp_transition_patch->mod->name, > klp_target_state == KLP_PATCHED ? "patching" : "unpatching"); > > + klp_cond_resched_disable(); > + Nit: Strictly speaking, this is not needed when klp_complete_transition() is called from klp_cancel_transition(). In this case, klp_cond_resched_enable() was not called. So it might be moved into klp_try_complete_transition(). More important thing, thinking loudly: We need to make sure that no task is in the middle klp_cond_resched_disable() when we modify anything that is used there. We seem to be on the safe side in klp_complete_transition(). We are here only when all tasks have TIF_PATCH_PENDING cleared. In this case, __klp_sched_try_switch() just returns. Also it calls klp_synchronize_transition() so that all tasks finish the critical part in __klp_sched_try_switch() before any new transition starts. But it is not the case in klp_reverse_transition(). It modifies klp_target_state() when __klp_sched_try_switch might be in the middle of klp_check_stack() and it might give wrong result. klp_reverse_transition() already solves similar race with klp_update_patch_state() by clearing all TIF_PATCH_PENDING flags and calling klp_synchronize_transition(). We just need to do it earlier. Something like: --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -642,14 +642,11 @@ void klp_reverse_transition(void) klp_target_state == KLP_PATCHED ? "patching to unpatching" : "unpatching to patching"); - klp_transition_patch->enabled = !klp_transition_patch->enabled; - - klp_target_state = !klp_target_state; /* * Clear all TIF_PATCH_PENDING flags to prevent races caused by - * klp_update_patch_state() running in parallel with - * klp_start_transition(). + * klp_update_patch_state() and __klp_sched_try_switch() + * running in parallel. */ read_lock(&tasklist_lock); for_each_process_thread(g, task) @@ -659,9 +656,16 @@ void klp_reverse_transition(void) for_each_possible_cpu(cpu) clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING); - /* Let any remaining calls to klp_update_patch_state() complete */ + /* + * Make sure that both klp_update_patch_state() and + * __klp_sched_try_switch() see the TIF_PATCH_PENDING cleared. + */ klp_synchronize_transition(); + klp_transition_patch->enabled = !klp_transition_patch->enabled; + + klp_target_state = !klp_target_state; + klp_start_transition(); } > if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) { > klp_unpatch_replaced_patches(klp_transition_patch); > klp_discard_nops(klp_transition_patch); Otherwise, the patch looks fine. I looked at it primary from the livepatching code side. Best Regards, Petr