On Tue, Jul 19, 2022 at 06:35:37PM +0100, Catalin Marinas wrote: > On Mon, Jun 20, 2022 at 01:41:58PM +0100, Mark Brown wrote: > > The documented syscall ABI specifies that the SVE state not shared with > > FPSIMD is undefined after a syscall. Currently we implement this by > > always flushing this register state to zero, ensuring consistent > > behaviour but introducing some overhead in the case where we can return > > directly to userspace without otherwise needing to update the register > > state. Take advantage of the flexibility offered by the documented ABI > > and instead leave the SVE registers untouched in the case where can > > return directly to userspace. > Do you have some rough numbers to quantify the gain? I suspect the > vector length doesn't matter much. I got some benchmarking done with fp-pidbench (which like I think I say somewhere in the series is a nonsense benchmark but still) which showed an IIRC approximately 1% or so win from a similar change that was fairly consistent over a few implementations. Unfortunately I don't yet have access to systems that would allow me to benchmark directly and nobody's got back with numbers for this specific code, the numbers were with some earlier proof of concept work. There is some vector length dependency when we move over 128 bits, at 128 bits there's no state in the Z registers that isn't shared with the V registers so we can and do already skip handling them entirely which makes the overhead of zeroing noticably lower, beyond that the cost should be stable. The additional cost for the Z registers was *about* the same as that of just doing the P registers IIRC, that does track with doing an additional 32 floating point operations. 128 bit systems are in general a bit of a special case for optimisation due to the reduced extra state. > Where does the zeroing happen now? IIRC it's only done on a subsequent > trap to SVE and that's a lot more expensive (unless the code has changed > since last time I looked). At the minute we drop SVE permission for userspace on syscall entry, the actual zeroing for the task happens when it next takes a SVE trap prior to reenabling SVE for EL0. > So if it's the actual subsequent trap that adds the overhead, maybe > zeroing the regs while leaving TIF_SVE on won't be that bad. Yeah, it's definitely *much* less exciting than the win from avoiding the SVE trap. > > The sysctl is disabled by default since it is anticipated that the risk > > of disruption to userspace is low. As well as being within the > > documented ABI this new behaviour mirrors the standard function call ABI > > for SVE in the AAPCS which should mean that compiler generated code is > > unlikely to rely on the current behaviour, the main risk is from hand > > coded assembly which directly invokes syscalls. The new behaviour is > > also what is currently implemented by qemu user mode emulation. > IIRC both Will and Mark R commented in the past that they'd like the > current de-facto ABI to become the official one. I'll let them comment. That would be good. I've not heard anything from Will either directly or indirectly. Mark R has indicated privately directly to me that he originally pushed for the currently implemented behaviour and prefers it. Marc Zyngier has previously noted publicly the current behaviour being a consideration in the context of discusion of optimisation ideas like this one, I was a bit surprised that he commented on an earlier patch in the series but not this one. You have previously pushed back on an attempt to document the current ABI, that was the main motivation for writing this patch. I don't have a particular opinion either way but I do feel strongly that we should either take advantage of the currently documented ABI or document our actual ABI, right now we've got the worst of both worlds so we should make a decision and pick one of those options. > > - if (test_thread_flag(TIF_SVE)) { > > + if (sve_syscall_regs_clear && test_thread_flag(TIF_SVE)) { > > unsigned int sve_vq_minus_one; > > sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1; > If we leave TIF_SVE on, does it mean that we incur an overhead on > context switching? E.g. something like hackbench with lots of syscalls > communicating between threads would unnecessarily context switch the SVE It is true that in the context switch case if we zero on syscall entry then we end up zeroing and then discarding the SVE state but in the context of the context switch and a future SVE trap that should be tolerable, and if the task doesn't use SVE between switches in syscalls then it'll not have SVE enabled be impacted either way. > state. Maybe there's something handling this but IIUC fpsimd_save() > seems to only check TIF_SVE. As of patch 6 in this series fpsimd_save() checks both TIF_SVE and in_syscall() when using FP_STATE_TASK so we should only save the FPSIMD state if we're in a syscall, in that patch we remove the clearing of TIF_SVE on syscall and update fpsimd_save() to also check in_syscall() when deciding what to save. Prior to that we're keeping the behaviour the same so fpsimd_sve() is only checking TIF_SVE, it's these last two patches that intend to introduce behaviour changes.
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm