Re: [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Sat, 18 May 2024 00:39:26 +0200

On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> Adjust affinity timers and watchdog_cpumask according to
> change of housekeeping.cpumasks[HK_TYPE_TIMER] during runtime.

Timers and watchdog are two different things. It's clearly documented
that a patch should change one thing and not combine random stuff.

> watchdog_cpumask is initialized during boot in lockup_detector_init()
> from housekeeping_cpumask(HK_TYPE_TIMER).
>
> lockup_detector_reconfigure() utilizes updated watchdog_cpumask
> via __lockup_detector_reconfigure().

You describe WHAT the patch is doing, but there is no WHY and zero
rationale why this is correct.

> timers_resettle_from_cpu() is blindly prototyped from timers_dead_cpu().
> local_irq_disable is used because cpuhp_thread_fun uses it before
> cpuhp_invoke_callback() call.

I have no idea what this word salad is trying to tell me.

The local_irq_disable() in cpuhp_thread_fun() has absolutely nothing to
do with timers_dead_cpu().

> Core test snippets without infrastructure:
>
> 1. Create timer on specific cpu with:
>
> 	timer_setup(&test_timer, test_timer_cb, TIMER_PINNED);
>         test_timer.expires = KTIME_MAX;
>         add_timer_on(&test_timer, test_cpu);
>
> 2. Call housekeeping_update()
>
> 3. Assure that there is no timers on specified cpu at the end
> of timers_resettle_from_cpu() with:
>
> static int count_timers(int cpu)
> {
> 	struct timer_base *base;
> 	int b, v, count = 0;
>
> 	for (b = 0; b < NR_BASES; b++) {
> 		base = per_cpu_ptr(&timer_bases[b], cpu);
> 		raw_spin_lock_irq(&base->lock);
>
> 		for (v = 0; v < WHEEL_SIZE; v++) {
> 			struct hlist_node *c;
>
> 			hlist_for_each(c, base->vectors + v)
> 				count++;
> 		}
> 		raw_spin_unlock_irq(&base->lock);
> 	}
>
> 	return count;
> }

And that snippet in the change log is magically providing a unit test or what?

> diff --git a/init/Kconfig b/init/Kconfig
> index 72404c1f21577..fac49c6bb965a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -682,6 +682,7 @@ config CPU_ISOLATION
>  	bool "CPU isolation"
>  	depends on SMP || COMPILE_TEST
>  	default y
> +	select HOTPLUG_CPU

Why?

> +#ifdef CONFIG_LOCKUP_DETECTOR

That ifdef is required because?

> +#include <linux/nmi.h>
> +#endif
> +
>  enum hk_flags {
>  	HK_FLAG_TIMER		= BIT(HK_TYPE_TIMER),
>  	HK_FLAG_RCU		= BIT(HK_TYPE_RCU),
> @@ -116,6 +120,19 @@ static void __init housekeeping_setup_type(enum hk_type type,
>  		     housekeeping_staging);
>  }
>  
> +static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable_mask)
> +{
> +	unsigned int cpu;
> +
> +	for_each_cpu(cpu, enable_mask)	{

Pointless bracket

> +		timers_prepare_cpu(cpu);

Seriously? You reset the timer base of an online CPU?

What's the point of this excercise? The timer base is initialized and in
consistent state. The timer base of an isolated CPU can have active
pinned timers on it.

So going there and resetting state without locking is definitely a
brilliant idea.

> +	for_each_cpu(cpu, disable_mask) {
> +		timers_resettle_from_cpu(cpu);
> +	}
> +}
> +
>  /*
>   * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
>   * change.
> @@ -144,6 +161,16 @@ static int housekeeping_update(enum hk_type type, cpumask_var_t update)
>  	if (!static_branch_unlikely(&housekeeping_overridden))
>  		static_key_enable_cpuslocked(&housekeeping_overridden.key);
>  
> +	switch (type) {
> +	case HK_TYPE_TIMER:
> +		resettle_all_timers(&masks->enable, &masks->disable);
> +#ifdef CONFIG_LOCKUP_DETECTOR
> +		cpumask_copy(&watchdog_cpumask, housekeeping_cpumask(HK_TYPE_TIMER));
> +		lockup_detector_reconfigure();
> +#endif

What's wrong with adding a function

void lockup_detector_update_mask(const struct cpumask *msk);

and having an empty stub for it when CONFIG_LOCKUP_DETECTOR=n?

That spares all the ugly ifdeffery and the nmi.h include in one go.

> +		break;
> +	default:
> +	}
>  	kfree(masks);
>  
>  	return 0;
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 48288dd4a102f..2d15c0e7b0550 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -51,6 +51,7 @@
>  #include <asm/div64.h>
>  #include <asm/timex.h>
>  #include <asm/io.h>
> +#include <linux/sched/isolation.h>

What needs this include?

>  #include "tick-internal.h"
>  #include "timer_migration.h"
> @@ -2657,6 +2658,47 @@ int timers_prepare_cpu(unsigned int cpu)
>  	return 0;
>  }
>  
> +/**
> + * timers_resettle_from_cpu - resettles timers from
> + * specified cpu to housekeeping cpus.
> + */
> +void timers_resettle_from_cpu(unsigned int cpu)
> +{
> +	struct timer_base *old_base;
> +	struct timer_base *new_base;
> +	int b, i;
> +
> +	local_irq_disable();

What for?

> +	for (b = 0; b < NR_BASES; b++) {

You cannot blindly move pinned timers away from a CPU. That's the last
resort which is used in the CPU hotplug case, but the CPU is not going
out in the dynamic change case and the pinned timer might be there for a
reason.

> +		old_base = per_cpu_ptr(&timer_bases[b], cpu);
> +		new_base = per_cpu_ptr(&timer_bases[b],
> +				cpumask_any_and(cpu_active_mask,
> +					housekeeping_cpumask(HK_TYPE_TIMER)));

The cpumask needs to be reevaluted for every base because you blindly
copied the code from timers_dead_cpu(), right?

> +		/*
> +		 * The caller is globally serialized and nobody else
> +		 * takes two locks at once, deadlock is not possible.
> +		 */
> +		raw_spin_lock_irq(&new_base->lock);

Here you disable interrupts again and further down you enable them again
when dropping the lock. So on loop exit this does an imbalanced
local_irq_enable(), no? What's the point of this local_irq_dis/enable()
pair around the loop?

> +		raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
> +
> +		/*
> +		 * The current CPUs base clock might be stale. Update it

What guarantees that this is the current CPUs timer base? Nothing...

> +		 * before moving the timers over.
> +		 */
> +		forward_timer_base(new_base);
> +
> +		WARN_ON_ONCE(old_base->running_timer);
> +		old_base->running_timer = NULL;
> +
> +		for (i = 0; i < WHEEL_SIZE; i++)
> +			migrate_timer_list(new_base, old_base->vectors + i);
> +
> +		raw_spin_unlock(&old_base->lock);
> +		raw_spin_unlock_irq(&new_base->lock);
> +	}
> +	local_irq_enable();
> +}

It's absolutely not rocket science to reuse timers_dead_cpu() for this.

The only requirement timers_dead_cpu() has is that the CPU to which the
timers are migrated is not going away. That's already given due to the
hotplug context.

The reason why it uses get_cpu_ptr(), which disables preemption and
therefore migration, is that it want's to avoid moving the timers to a
remote CPU as that has implications vs. NOHZ because it might end up
kicking the remote CPU out of idle.

timers_dead_cpu() could be simply modified to:

void timers_dead_cpu(unsigned int cpu)
{
    migrate_disable();
    timers_migrate_from_cpu(cpu, BASE_LOCAL);
    migrate_enable();
}

and have

#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_ISOLATION)
static void timers_migrate_from_cpu(unsigned int cpu, unsigned int base)
{
        lockdep_assert(migration_disabled());

        for (; base < NR_BASES; base++) {
		old_base = per_cpu_ptr(&timer_bases[b], cpu);
		new_base = this_cpu_ptr(&timer_bases[b]);
        	....
	}
}
#endif

Now that isolation muck just has to do:

    1) Ensure that the CPU it runs on is a housekeeping CPU

    2) Invoke timers_migrate_to_hkcpu(cpu) which is in timer.c

       #ifdef CONFIG_ISOLATION
       void timers_migrate_to_hkcpu(unsigned int cpu)
       {
       		timers_migrate_from_cpu(cpu, BASE_GLOBAL);
       }
       #endif

No?

Thanks,

        tglx