Re: [PATCH 2/3] livepatch,sched: Add livepatch task switching to cond_resched()

Josh Poimboeuf <jpoimboe@xxxxxxxxxx> · Wed, 15 Feb 2023 18:26:30 -0800

On Wed, Feb 15, 2023 at 02:30:36PM +0100, Petr Mladek wrote:
> >  static inline int _cond_resched(void)
> >  {
> > +	klp_sched_try_switch();
> >  	return __cond_resched();
> 
> My only concern is if it might cause any performance problems.
> 
> On one hand, cond_resched() is used in code paths that are slow
> on its own. Also it will do nothing most of the time.
> 
> On the other hand, cond_resched() is typically used in cycles.
> One cycle might be fast. The code might be slow because there
> are too many cycles. Repeating the same failing test might
> prolong the time significantly.

Yes, but it should hopefully be very rare to patch a function in the
call stack of a kthread loop.  In general it's a good idea for the patch
author to avoid that.

> An idea is to try the switch only when it was not done during
> a real schedule. Something like:
> 
> static inline int _cond_resched(void)
> {
> 	int scheduled;
> 
> 	scheduled = __cond_resched();
> 	if (scheduled)
> 		klp_sched_try_switch();
> 
> 	return scheduled();
> }
> 
> But it would make it less reliable/predictable. Also it won't work
> in configurations when cond_resched() is always a nop.
> 
> I am probably too careful. We might keep it simple until any real
> life problems are reported.

If we can get away with it, I much prefer the simple unconditional
klp_sched_try_switch() because of the predictability and quickness with
which the kthread gets patched.

> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -76,6 +96,8 @@ static void klp_complete_transition(void)
> >  		 klp_transition_patch->mod->name,
> >  		 klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
> >  
> > +	klp_cond_resched_disable();
> > +
> 
> Nit: Strictly speaking, this is not needed when klp_complete_transition()
>      is called from klp_cancel_transition(). In this case,
>      klp_cond_resched_enable() was not called. So it might be moved into
>      klp_try_complete_transition().

Argh, I always forget about that pesky klp_cancel_transition().

> More important thing, thinking loudly:
> 
> We need to make sure that no task is in the middle
> klp_cond_resched_disable() when we modify anything that is used there.
> 
> We seem to be on the safe side in klp_complete_transition(). We are
> here only when all tasks have TIF_PATCH_PENDING cleared. In this case,
> __klp_sched_try_switch() just returns. Also it calls
> klp_synchronize_transition() so that all tasks finish the critical part
> in __klp_sched_try_switch() before any new transition starts.
> 
> But it is not the case in klp_reverse_transition(). It modifies
> klp_target_state() when __klp_sched_try_switch might be in the middle
> of klp_check_stack() and it might give wrong result.
> 
> klp_reverse_transition() already solves similar race with
> klp_update_patch_state() by clearing all TIF_PATCH_PENDING flags
> and calling klp_synchronize_transition(). We just need to do
> it earlier. Something like:

Yes!  Thanks, I can always count on you to find the race conditions ;-)

This highlights the similarities between klp_target_state(current) and
__klp_sched_try_switch(), they both access TIF_PATCH_PENDING
out-of-band.

Also, I'll update the comment in klp_copy_process(). It should be safe
for with __klp_sched_try_switch() for the same reason as
klp_update_patch_state(current): they all only work on 'current'.

-- 
Josh