Re: [PATCH v10 07/18] arm64: fpsimd: Eliminate task->mm checks

Dave Martin <Dave.Martin@xxxxxxx> · Thu, 24 May 2018 15:37:15 +0100

On Thu, May 24, 2018 at 12:06:59PM +0200, Christoffer Dall wrote:
> On Thu, May 24, 2018 at 10:50:56AM +0100, Dave Martin wrote:
> > On Thu, May 24, 2018 at 10:33:50AM +0200, Christoffer Dall wrote:

[...]

> > > ...with a risk of being a bit over-pedantic and annoying, may I suggest
> > > the following complete commit text:
> > > 
> > > ------8<------
> > > Currently the FPSIMD handling code uses the condition task->mm ==
> > > NULL as a hint that task has no FPSIMD register context.
> > > 
> > > The ->mm check is only there to filter out tasks that cannot
> > > possibly have FPSIMD context loaded, for optimisation purposes.
> > > However, TIF_FOREIGN_FPSTATE must always be checked anyway before
> > > saving FPSIMD context back to memory.  For this reason, the ->mm
> > > checks are not useful, providing that that TIF_FOREIGN_FPSTATE is
> > > maintained properly for kernel threads.
> > > 
> > > FPSIMD context is never preserved for kernel threads across a context
> > > switch and therefore TIF_FOREIGN_FPSTATE should always be true for
> > 
> > (This refactoring opens up the interesting possibility of making
> > kernel-mode NEON in task context preemptible for kernel threads so
> > that we actually do preserve state... but that's a discussion for
> > another day.  There may be code around that relies on
> > kernel_neon_begin() disabling preemption for real.)
> > 
> > > kernel threads.  This is indeed the case, as the wrong_task and
> > 
> > This suggests that TIF_FOREIGN_FPSTATE is always true for kernel
> > threads today.  This is not quite because use_mm() can make mm non-
> > NULL.
> > 
> 
> I was suggesting that it's always true after this patch.

I tend to read the present tense as describing the situation before the
patch, but this convention isn't followed universally.

This was part of the problem with my "true by construction" weasel
words: the described property wasn't true by construction prior to the
patch, and there wasn't sufficient explanation to convince people it's
true afterwards.  If people are bring rigorous, it takes a _lot_ of
explanation...

> 
> > > wrong_cpu tests in fpsimd_thread_switch() will always yield false for
> > > kernel threads.
> > 
> > ("false" -> "true".  My bad.)
> > 
> > > Further, the context switch logic is already deliberately optimised to
> > > defer reloads of the FPSIMD context until ret_to_user (or sigreturn as a
> > > special case), which kernel threads by definition never reach, and
> > > therefore this change introduces no additional work in the critical
> > > path.
> > > 
> > > This patch removes the redundant checks and special-case code.
> > > ------8<------
> > 
> > Looking at my existing text, I rather reworded it like this.
> > Does this work any better for you?
> > 
> > --8<--
> > 
> > Currently the FPSIMD handling code uses the condition task->mm ==
> > NULL as a hint that task has no FPSIMD register context.
> > 
> > The ->mm check is only there to filter out tasks that cannot
> > possibly have FPSIMD context loaded, for optimisation purposes.
> > Also, TIF_FOREIGN_FPSTATE must always be checked anyway before
> > saving FPSIMD context back to memory.  For these reasons, the ->mm
> > checks are not useful, providing that TIF_FOREIGN_FPSTATE is
> > maintained in a consistent way for kernel threads.
> 
> Consistent with what?  Without more context or explanation,

Consistent with the handling of user threads (though I admit it's not
explicit in the text.)

> I'm not sure what the reader is to make of that.  Do you not mean the
> TIF_FOREIGN_FPSTATE is always true for kernel threads?

Again, this is probably a red herring.  TIF_FOREIGN_FPSTATE is always
true for kernel threads prior to the patch, except (randomly) for the
init task.

This change is not really about TIF_FOREIGN_FPSTATE at all, rather
that there is nothing to justify handling kernel threads differently,
or even distinguishing kernel threads from user threads at all in this
code.

Part of the confusion (and I had confused myself) comes from the fact
that TIF_FOREIGN_FPSTATE is really a per-cpu property and doesn't make
sense as a per-task property -- i.e., the flag is meaningless for
scheduled-out tasks and we must explicitly "repair" it when scheduling
a task in anyway.  I think it's a thread flag primarily so that it's
convenient to check alongside other thread flags in the ret_to_user
work loop.  This is somewhat less of a justification now that loop was
ported to C.

> > 
> > The context switch logic is already deliberately optimised to defer
> > reloads of the regs until ret_to_user (or sigreturn as a special
> > case), and save them only if they have been previously loaded.

Does it help to insert the following here?

"These paths are the only places where the wrong_task and wrong_cpu
conditions can be made false, by calling fpsimd_bind_task_to_cpu()."

> > Kernel threads by definition never reach these paths.  As a result,
> 
> I'm struggling with the "As a result," here.  Is this because reloads of
> regs in ret_to_user (or sigreturn) are the only places that can make
> wrong_cpu or wrong_task be false?

See the proposed clarification above.  Is that sufficient?

> (I'm actually wanting to understand this, not just bikeshedding the
> commit message, as new corner cases keep coming up on this logic.)

That's a good thing, and I would really like to explain it in a
concise manner.  See [*] below for the "concise" explanation -- it may
demonstrate why I've been evasive...

> > the wrong_task and wrong_cpu tests in fpsimd_thread_switch() will
> > always yield true for kernel threads.
> > 
> > This patch removes the redundant checks and special-case code,                  ensuring that TIF_FOREIGN_FPSTATE is set whenever a kernel thread               is scheduled in, and ensures that this flag is set for the init
> > task.  The fpsimd_flush_task_state() call already present in                    copy_thread() ensures the same for any new task.
> 
> nit: funny formatting

Dang, I was repeatedly pasing between Mutt and git commit terminals,
which doesn't always work as I'd like...

> nit: ensuring that TIF_FOREIGN_FPSTATE *remains* set whenever a kernel
> thread is scheduled in?

Er, yes.

> > With TIF_FOREIGN_FPSTATE always set for kernel threads, this patch
> > ensures that no extra context save work is added for kernel
> > threads, and eliminates the redundant context saving that may
> > currently occur for kernel threads that have acquired an mm via
> > use_mm().
> > 
> > -->8--
> 
> If you can slightly connect the dots with the "As a result" above, I'm
> fine with your version of the text.

As an aside, the big wall of text before the definition of struct
fpsimd_last_state_struct is looking out of date and could use an
update to cover at least some of what is explained in [*] better.

I'm currently considering that out of scope for this series, but I will
keep it in mind to refresh it in the not too distant future.

Cheers
---Dave

--8<--

[*] The bigger picture:

* Consider a relation (C,T) between cpus C and tasks T, such that
  (C,T) means "T's FPSIMD regs are loaded on cpu C".

  At a given point of execution of some cpu C, there is at most one task
  T for which (C,T) holds.

  At a given point of execution of some task T, there is at most one
  cpu C for which (C,T) holds.

* (C,T) becomes true whenever T's registers are loaded into cpu C.

* At sched-out, we must ensure that the registers of current are
  loaded before writing them to current's thread_struct.  Thus, we
  must save the registers if and only if (smp_processor_id(), current)
  holds at this time.

* Before entering userspace, we must ensure that current's regs
  are loaded, and we must only load the regs if they are not loaded
  already (since if so, they might have been dirtied by current in
  userspace since last loaded).

  Thus, when entering userspace, we must load the regs from memory
  if and only if (smp_processor_id(), current) does not hold.

* Checking this relation involves per-CPU access and inspection of
  current->thread, and was presumably considered too cumbersome for
  implemenation an entry.S, particluarly in the ret_to_user work
  pending loop (which is where the FPSIMD regs are finally loaded
  before entering userspace, if they weren't loaded already).

  To mitigate this, the status of the check is cached in a thread flag
  TIF_FOREIGN_FPSTATE: with softirqs disabled, (smp_processor_id(),
  current) holds if and only if TIF_FOREIGN_FPSTATE is false.
  TIF_FOREIGN_FPSTATE is corrected on sched-in by the code in
  fpsimd_thread_switch().

[2] Anything that changes the state of the relation for current
  requires its TIF_FOREIGN_FPSTATE to be changed to match.

* (smp_processor_id(), current) is established in
  fpsimd_bind_task_to_cpu().  This is the only way the relation can be
  made to hold between a task and a CPU.

* (C,T) is broken whenever

[1] T is created;

  * T's regs are loaded onto a different cpu C2, so (C2,T) becomes
    true and (C,T) necessarily becomes false;

  * another task's regs are loaded into C, so (C,T2) becomes true
    and (C,T) necessarily becomes false;

  * the kernel clobbers the regs on C for its own purposes, so
    (C,T) becomes false but there is no T2 for which (C,T2) becomes
    true as a result.  Examples are kernel-mode NEON and loading
    the regs for a KVM vcpu;

  * T's register context changes via a thread_struct update instead
    of running instructions in userspace, requiring the contents of
    the hardware regs to be thrown away.  Examples are exec() (which
    requires the registers to be zeroed), sigreturn (which populates the
    regs from the user signal frame) and modification of the registers
    via PTRACE_SETREGSET;

    As a (probably unnecesary) optimisation, sigreturn immediately
    loads the registers and reestablishes (smp_processor_id(), current)
    in anticipation of the return to userspace which is likely to
    occur soon.  This allows the relation breaking logic to be omitted
    in fpsimd_update_current_state() which does the work.

* In general, these relation breakings involve an unknown: knowing
  either C or T but *not* both, we want to break (C,T).  If the
  relation were recorded in task_struct only, we would need to scan all
  tasks in the "T unknown" case.  If the relation were recorded in a
  percpu variable only, we would need to scan all CPUs in the "C
  unknown" case.  As well as having gnarly synchronisation
  requirements, these would get expensive in many-tasks or many-cpus
  situations.

  This is why the relation is recorded in both places, and is only
  deemed to hold if the two records match up.  This is what
  fpsimd_thread_switch() is checking for the task being scheduled in.

  The invalidation (breaking) operations are now factored as

  fpsimd_flush_task_state(): falsify (C,current) for every cpu C.
  This is done by zapping current->thread.fpsimd_cpu with NR_CPUS
  (chosen because it cannot match smp_processor_id()).

  fpsumd_flush_cpu_state(): falsify (smp_processor_id(),T) for every
  task T.  This is done by zapping this_cpu(fpsimd_last_state.st)
  with NULL (chosen because it cannot match &T->thread.uw.fpsimd_state
  for any task).

  By [2] above, it is necessary to ensure that TIF_FOREIGN_FPSTATE is
  set after calling either of the above functions.  Of the two,
  fpsimd_flush_cpu_state() now does this implicitly but
  fpsimd_flush_task_state() does not: but the caller must do it
  instead.  I have a vague memory of some refactoring obstacle that
  dissuaded me from pulling the set_thread_flag in, but I can't
  remember it now.  I may review this later.

* Because the (C,T) relation may need to be manipulated by
  kernel_neon_{begin,end}() in softirq context, examining or
  manipulating for current or the running CPU must be done under
  local_bh_disable().  The same goes for TIF_FOREIGN_FPSTATE which is
  supposed to represent the same condition but may spontaneously become
  stale if softirqs are not masked.  (The rule is not quite as strict
  as this, but in order to make the code easier to reason about, I skip
  the local_bh_disable() only where absolutely necessary --
  restore_sve_fpsimd_context() is the only example today.)

Now, imagine that T is a kernel thread, and consider what needs to
be done differently.  The observation of this patch is that nothing
needs to be done differently at all.

There is a single anomaly relating to [1] above, in the form of a task
that can run without ever being scheduled in: the init task.  Beyond
that, kernel_neon_begin() before the first reschedule would spuriously
save the FPSIMD regs into the init_task's thread struct, even though it
is pointless to do so.  This patch fixes those anomalies by updating
INIT_THREAD and INIT_THREAD_INFO to set up the init task so that it
looks the same as some other kernel thread that has been scheduled in.

There is a strong design motivation to avoid unnecessary loads and
saves of the state, so if removing the special-casing of kernel threads
were to add cost it would imply that the code were _already_ suboptimal
for user tasks.  This patch does not attempt to address that at all,
but by assuming that the code is already well-optimised, "unnecessary"
save/restore work will not be added.  If this were not the case, it
could in any case be fixed independently.

The observation of this _series_ is that we don't need to do very
much in order to be able to generalise the logic to accept KVM vcpus
in place of T.

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm