Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Thu, 17 Dec 2020 23:43:16 +0100

On Thu, Dec 17 2020 at 15:50, Thomas Gleixner wrote:
> On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
>
>> +void write_pkrs(u32 new_pkrs)
>> +{
>> +	u32 *pkrs;
>> +
>> +	if (!static_cpu_has(X86_FEATURE_PKS))
>> +		return;
>> +
>> +	pkrs = get_cpu_ptr(&pkrs_cache);
>
> So this is called from various places including schedule and also from
> the low level entry/exit code. Why do we need to have an extra
> preempt_disable/enable() there via get/put_cpu_ptr()?
>
> Just because performance in those code paths does not matter?
>
>> +	if (*pkrs != new_pkrs) {
>> +		*pkrs = new_pkrs;
>> +		wrmsrl(MSR_IA32_PKRS, new_pkrs);
>> +	}
>> +	put_cpu_ptr(pkrs);

Which made me look at the other branch of your git repo just because I
wanted to know about the 'other' storage requirements and I found this
gem:

> update_global_pkrs()
> ...
>	/*
>	 * If we are preventing access from the old value.  Force the
>	 * update on all running threads.
>	 */
>	if (((old_val == 0) && protection) ||
>	    ((old_val & PKR_WD_BIT) && (protection & PKEY_DISABLE_ACCESS))) {
>		int cpu;
>
>		for_each_online_cpu(cpu) {
>			u32 *ptr = per_cpu_ptr(&pkrs_cache, cpu);
>
>			*ptr = update_pkey_val(*ptr, pkey, protection);
>			wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr);
>			put_cpu_ptr(ptr);

1) per_cpu_ptr() -> put_cpu_ptr() is broken as per_cpu_ptr() is not
   disabling preemption while put_cpu_ptr() enables it which wreckages
   the preemption count. 

   How was that ever tested at all with any debug option enabled?

   Answer: Not at all

2) How is that sequence:

	ptr = per_cpu_ptr(&pkrs_cache, cpu);
	*ptr = update_pkey_val(*ptr, pkey, protection);
	wrmsrl_on_cpu(cpu, MSR_IA32_PKRS, *ptr);

   supposed to be correct vs. a concurrent modification of the
   pkrs_cache of the remote CPU?

   Answer: Not at all

Also doing a wrmsrl_on_cpu() on _each_ online CPU is insane at best.

  A smp function call on a remote CPU takes ~3-5us when the remote CPU
  is not idle and can immediately respond. If the remote CPU is deep in
  idle it can take up to 100us depending on C-State it is in.

  Even if the remote CPU is not not idle and just has interrupts
  disabled for a few dozen of microseconds this adds up.

  So on a 256 CPU system depending on the state of the remote CPUs this
  stalls the CPU doing the update for anything between 1 and 25ms worst
  case.

  Of course that also violates _all_ CPU isolation mechanisms.

  What for?

  Just for the theoretical chance that _all_ remote CPUs have
  seen that global permission and have it still active?

  You're not serious about that, right?

The only use case for this in your tree is: kmap() and the possible
usage of that mapping outside of the thread context which sets it up.

The only hint for doing this at all is:

    Some users, such as kmap(), sometimes requires PKS to be global.

'sometime requires' is really _not_ a technical explanation.

Where is the explanation why kmap() usage 'sometimes' requires this
global trainwreck in the first place and where is the analysis why this
can't be solved differently?

Detailed use case analysis please.

Thanks,

        tglx