Re: [RFC PATCH 8/8] irq: boost/unboost in irq/nmi entry/exit and softirq

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Fri, 15 Dec 2023 18:26:51 +0100

On Wed, Dec 13 2023 at 21:47, Vineeth Pillai (Google) wrote:
> The host proactively boosts the VCPU threads during irq/nmi injection.
> However, the host is unaware of posted interrupts, and therefore, the
> guest should request a boost if it has not already been boosted.
>
> Similarly, guest should request an unboost on irq/nmi/softirq exit if
> the vcpu doesn't need the boost any more.

That's giving a hint but no context for someone who is not familiar with
the problem which is tried to be solved here.

> @@ -327,6 +327,13 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  		.exit_rcu = false,
>  	};
>  
> +#ifdef CONFIG_PARAVIRT_SCHED
> +	instrumentation_begin();

Slapping instrumentation_begin() at it silences the objtool checker, but
that does not make it correct in any way.

You _cannot_ call random code _before_ the kernel has established
context. It's clearly documented:

  https://www.kernel.org/doc/html/latest/core-api/entry.html

No?

> +	if (pv_sched_enabled())
> +		pv_sched_boost_vcpu_lazy();
> +	instrumentation_end();
> +#endif
> +
>  	if (user_mode(regs)) {
>  		irqentry_enter_from_user_mode(regs);
>  		return ret;
> @@ -452,6 +459,18 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		if (state.exit_rcu)
>  			ct_irq_exit();
>  	}
> +
> +#ifdef CONFIG_PARAVIRT_SCHED
> +	instrumentation_begin();

Broken too

> +	/*
> +	 * On irq exit, request a deboost from hypervisor if no softirq pending
> +	 * and current task is not RT and !need_resched.
> +	 */
> +	if (pv_sched_enabled() && !local_softirq_pending() &&
> +			!need_resched() && !task_is_realtime(current))
> +		pv_sched_unboost_vcpu();
> +	instrumentation_end();
> +#endif
>  }
>  
>  irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
> @@ -469,6 +488,11 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
>  	kmsan_unpoison_entry_regs(regs);
>  	trace_hardirqs_off_finish();
>  	ftrace_nmi_enter();
> +
> +#ifdef CONFIG_PARAVIRT_SCHED
> +	if (pv_sched_enabled())
> +		pv_sched_boost_vcpu_lazy();
> +#endif
>  	instrumentation_end();
>  
>  	return irq_state;
> @@ -482,6 +506,12 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
>  		trace_hardirqs_on_prepare();
>  		lockdep_hardirqs_on_prepare();
>  	}
> +
> +#ifdef CONFIG_PARAVIRT_SCHED
> +	if (pv_sched_enabled() && !in_hardirq() && !local_softirq_pending() &&
> +			!need_resched() && !task_is_realtime(current))
> +		pv_sched_unboost_vcpu();
> +#endif

Symmetry is overrated. Just pick a spot and slap your hackery in.

Aside of that this whole #ifdeffery is tasteless at best.

>  	instrumentation_end();

> +#ifdef CONFIG_PARAVIRT_SCHED
> +	if (pv_sched_enabled())
> +		pv_sched_boost_vcpu_lazy();
> +#endif

But what's worse is that this is just another approach of sprinkling
random hints all over the place and see what sticks.

Abusing KVM as the man in the middle to communicate between guest and
host scheduler is creative, but ill defined.

>From the host scheduler POV the vCPU is just a user space thread and
making the guest special is just the wrong approach.

The kernel has no proper general interface to convey information from a
user space task to the scheduler.

So instead of adding some ad-hoc KVM hackery the right thing is to solve
this problem from ground up in a generic way and KVM can just utilize
that instead of having the special snow-flake hackery which is just a
maintainability nightmare.

Thanks,

        tglx