On Thu, May 24, 2018 at 12:06:59PM +0200, Christoffer Dall wrote: > On Thu, May 24, 2018 at 10:50:56AM +0100, Dave Martin wrote: > > On Thu, May 24, 2018 at 10:33:50AM +0200, Christoffer Dall wrote: [...] > > > ...with a risk of being a bit over-pedantic and annoying, may I suggest > > > the following complete commit text: > > > > > > ------8<------ > > > Currently the FPSIMD handling code uses the condition task->mm == > > > NULL as a hint that task has no FPSIMD register context. > > > > > > The ->mm check is only there to filter out tasks that cannot > > > possibly have FPSIMD context loaded, for optimisation purposes. > > > However, TIF_FOREIGN_FPSTATE must always be checked anyway before > > > saving FPSIMD context back to memory. For this reason, the ->mm > > > checks are not useful, providing that that TIF_FOREIGN_FPSTATE is > > > maintained properly for kernel threads. > > > > > > FPSIMD context is never preserved for kernel threads across a context > > > switch and therefore TIF_FOREIGN_FPSTATE should always be true for > > > > (This refactoring opens up the interesting possibility of making > > kernel-mode NEON in task context preemptible for kernel threads so > > that we actually do preserve state... but that's a discussion for > > another day. There may be code around that relies on > > kernel_neon_begin() disabling preemption for real.) > > > > > kernel threads. This is indeed the case, as the wrong_task and > > > > This suggests that TIF_FOREIGN_FPSTATE is always true for kernel > > threads today. This is not quite because use_mm() can make mm non- > > NULL. > > > > I was suggesting that it's always true after this patch. I tend to read the present tense as describing the situation before the patch, but this convention isn't followed universally. This was part of the problem with my "true by construction" weasel words: the described property wasn't true by construction prior to the patch, and there wasn't sufficient explanation to convince people it's true afterwards. If people are bring rigorous, it takes a _lot_ of explanation... > > > > wrong_cpu tests in fpsimd_thread_switch() will always yield false for > > > kernel threads. > > > > ("false" -> "true". My bad.) > > > > > Further, the context switch logic is already deliberately optimised to > > > defer reloads of the FPSIMD context until ret_to_user (or sigreturn as a > > > special case), which kernel threads by definition never reach, and > > > therefore this change introduces no additional work in the critical > > > path. > > > > > > This patch removes the redundant checks and special-case code. > > > ------8<------ > > > > Looking at my existing text, I rather reworded it like this. > > Does this work any better for you? > > > > --8<-- > > > > Currently the FPSIMD handling code uses the condition task->mm == > > NULL as a hint that task has no FPSIMD register context. > > > > The ->mm check is only there to filter out tasks that cannot > > possibly have FPSIMD context loaded, for optimisation purposes. > > Also, TIF_FOREIGN_FPSTATE must always be checked anyway before > > saving FPSIMD context back to memory. For these reasons, the ->mm > > checks are not useful, providing that TIF_FOREIGN_FPSTATE is > > maintained in a consistent way for kernel threads. > > Consistent with what? Without more context or explanation, Consistent with the handling of user threads (though I admit it's not explicit in the text.) > I'm not sure what the reader is to make of that. Do you not mean the > TIF_FOREIGN_FPSTATE is always true for kernel threads? Again, this is probably a red herring. TIF_FOREIGN_FPSTATE is always true for kernel threads prior to the patch, except (randomly) for the init task. This change is not really about TIF_FOREIGN_FPSTATE at all, rather that there is nothing to justify handling kernel threads differently, or even distinguishing kernel threads from user threads at all in this code. Part of the confusion (and I had confused myself) comes from the fact that TIF_FOREIGN_FPSTATE is really a per-cpu property and doesn't make sense as a per-task property -- i.e., the flag is meaningless for scheduled-out tasks and we must explicitly "repair" it when scheduling a task in anyway. I think it's a thread flag primarily so that it's convenient to check alongside other thread flags in the ret_to_user work loop. This is somewhat less of a justification now that loop was ported to C. > > > > The context switch logic is already deliberately optimised to defer > > reloads of the regs until ret_to_user (or sigreturn as a special > > case), and save them only if they have been previously loaded. Does it help to insert the following here? "These paths are the only places where the wrong_task and wrong_cpu conditions can be made false, by calling fpsimd_bind_task_to_cpu()." > > Kernel threads by definition never reach these paths. As a result, > > I'm struggling with the "As a result," here. Is this because reloads of > regs in ret_to_user (or sigreturn) are the only places that can make > wrong_cpu or wrong_task be false? See the proposed clarification above. Is that sufficient? > (I'm actually wanting to understand this, not just bikeshedding the > commit message, as new corner cases keep coming up on this logic.) That's a good thing, and I would really like to explain it in a concise manner. See [*] below for the "concise" explanation -- it may demonstrate why I've been evasive... > > the wrong_task and wrong_cpu tests in fpsimd_thread_switch() will > > always yield true for kernel threads. > > > > This patch removes the redundant checks and special-case code, ensuring that TIF_FOREIGN_FPSTATE is set whenever a kernel thread is scheduled in, and ensures that this flag is set for the init > > task. The fpsimd_flush_task_state() call already present in copy_thread() ensures the same for any new task. > > nit: funny formatting Dang, I was repeatedly pasing between Mutt and git commit terminals, which doesn't always work as I'd like... > nit: ensuring that TIF_FOREIGN_FPSTATE *remains* set whenever a kernel > thread is scheduled in? Er, yes. > > With TIF_FOREIGN_FPSTATE always set for kernel threads, this patch > > ensures that no extra context save work is added for kernel > > threads, and eliminates the redundant context saving that may > > currently occur for kernel threads that have acquired an mm via > > use_mm(). > > > > -->8-- > > If you can slightly connect the dots with the "As a result" above, I'm > fine with your version of the text. As an aside, the big wall of text before the definition of struct fpsimd_last_state_struct is looking out of date and could use an update to cover at least some of what is explained in [*] better. I'm currently considering that out of scope for this series, but I will keep it in mind to refresh it in the not too distant future. Cheers ---Dave --8<-- [*] The bigger picture: * Consider a relation (C,T) between cpus C and tasks T, such that (C,T) means "T's FPSIMD regs are loaded on cpu C". At a given point of execution of some cpu C, there is at most one task T for which (C,T) holds. At a given point of execution of some task T, there is at most one cpu C for which (C,T) holds. * (C,T) becomes true whenever T's registers are loaded into cpu C. * At sched-out, we must ensure that the registers of current are loaded before writing them to current's thread_struct. Thus, we must save the registers if and only if (smp_processor_id(), current) holds at this time. * Before entering userspace, we must ensure that current's regs are loaded, and we must only load the regs if they are not loaded already (since if so, they might have been dirtied by current in userspace since last loaded). Thus, when entering userspace, we must load the regs from memory if and only if (smp_processor_id(), current) does not hold. * Checking this relation involves per-CPU access and inspection of current->thread, and was presumably considered too cumbersome for implemenation an entry.S, particluarly in the ret_to_user work pending loop (which is where the FPSIMD regs are finally loaded before entering userspace, if they weren't loaded already). To mitigate this, the status of the check is cached in a thread flag TIF_FOREIGN_FPSTATE: with softirqs disabled, (smp_processor_id(), current) holds if and only if TIF_FOREIGN_FPSTATE is false. TIF_FOREIGN_FPSTATE is corrected on sched-in by the code in fpsimd_thread_switch(). [2] Anything that changes the state of the relation for current requires its TIF_FOREIGN_FPSTATE to be changed to match. * (smp_processor_id(), current) is established in fpsimd_bind_task_to_cpu(). This is the only way the relation can be made to hold between a task and a CPU. * (C,T) is broken whenever [1] T is created; * T's regs are loaded onto a different cpu C2, so (C2,T) becomes true and (C,T) necessarily becomes false; * another task's regs are loaded into C, so (C,T2) becomes true and (C,T) necessarily becomes false; * the kernel clobbers the regs on C for its own purposes, so (C,T) becomes false but there is no T2 for which (C,T2) becomes true as a result. Examples are kernel-mode NEON and loading the regs for a KVM vcpu; * T's register context changes via a thread_struct update instead of running instructions in userspace, requiring the contents of the hardware regs to be thrown away. Examples are exec() (which requires the registers to be zeroed), sigreturn (which populates the regs from the user signal frame) and modification of the registers via PTRACE_SETREGSET; As a (probably unnecesary) optimisation, sigreturn immediately loads the registers and reestablishes (smp_processor_id(), current) in anticipation of the return to userspace which is likely to occur soon. This allows the relation breaking logic to be omitted in fpsimd_update_current_state() which does the work. * In general, these relation breakings involve an unknown: knowing either C or T but *not* both, we want to break (C,T). If the relation were recorded in task_struct only, we would need to scan all tasks in the "T unknown" case. If the relation were recorded in a percpu variable only, we would need to scan all CPUs in the "C unknown" case. As well as having gnarly synchronisation requirements, these would get expensive in many-tasks or many-cpus situations. This is why the relation is recorded in both places, and is only deemed to hold if the two records match up. This is what fpsimd_thread_switch() is checking for the task being scheduled in. The invalidation (breaking) operations are now factored as fpsimd_flush_task_state(): falsify (C,current) for every cpu C. This is done by zapping current->thread.fpsimd_cpu with NR_CPUS (chosen because it cannot match smp_processor_id()). fpsumd_flush_cpu_state(): falsify (smp_processor_id(),T) for every task T. This is done by zapping this_cpu(fpsimd_last_state.st) with NULL (chosen because it cannot match &T->thread.uw.fpsimd_state for any task). By [2] above, it is necessary to ensure that TIF_FOREIGN_FPSTATE is set after calling either of the above functions. Of the two, fpsimd_flush_cpu_state() now does this implicitly but fpsimd_flush_task_state() does not: but the caller must do it instead. I have a vague memory of some refactoring obstacle that dissuaded me from pulling the set_thread_flag in, but I can't remember it now. I may review this later. * Because the (C,T) relation may need to be manipulated by kernel_neon_{begin,end}() in softirq context, examining or manipulating for current or the running CPU must be done under local_bh_disable(). The same goes for TIF_FOREIGN_FPSTATE which is supposed to represent the same condition but may spontaneously become stale if softirqs are not masked. (The rule is not quite as strict as this, but in order to make the code easier to reason about, I skip the local_bh_disable() only where absolutely necessary -- restore_sve_fpsimd_context() is the only example today.) Now, imagine that T is a kernel thread, and consider what needs to be done differently. The observation of this patch is that nothing needs to be done differently at all. There is a single anomaly relating to [1] above, in the form of a task that can run without ever being scheduled in: the init task. Beyond that, kernel_neon_begin() before the first reschedule would spuriously save the FPSIMD regs into the init_task's thread struct, even though it is pointless to do so. This patch fixes those anomalies by updating INIT_THREAD and INIT_THREAD_INFO to set up the init task so that it looks the same as some other kernel thread that has been scheduled in. There is a strong design motivation to avoid unnecessary loads and saves of the state, so if removing the special-casing of kernel threads were to add cost it would imply that the code were _already_ suboptimal for user tasks. This patch does not attempt to address that at all, but by assuming that the code is already well-optimised, "unnecessary" save/restore work will not be added. If this were not the case, it could in any case be fixed independently. The observation of this _series_ is that we don't need to do very much in order to be able to generalise the logic to accept KVM vcpus in place of T. _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm