On Wed, Jan 24, 2024 at 01:32:22PM +0000, Marc Zyngier wrote: > > So, I'm fine if the answer is that VMM's using VFIO need to use > > KVM_CAP_ARM_NISV_TO_USER and do instruction parsing for emulated IO in > > userspace if they have a design where VFIO MMIO can infrequently > > generate faults. That is all VMM design stuff and has nothing to do > > with the kernel. > > Which will work a treat with things like CCA, I'm sure. CCA shouldn't have emulation or trapping on the MMIO mappings. > > > Or you can stop whining and try to get better performance out of what > > > we have today. > > > > "better performance"!?!? You are telling me I have to destroy one of > > our important fast paths for HPC workloads to accommodate some > > theoretical ARM KVM problem? > > What I'm saying is that there are way to make it better without > breaking your particular toy workload which, as important as it may be > to *you*, doesn't cover everybody's use case. Please, do we need the "toy" stuff? The industry is spending 10's of billions of dollars right now to run "my workload". Currently not widely on ARM servers, but we are all hoping ARM can succeed here, right? I still don't know what you mean by "better". There are several issues now 1) This series, where WC doesn't trigger on new cores. Maybe 8x STR will fix it, but it is not better performance wise than 4x STP. 2) Userspace does ST4 to MMIO memory, and the VMM can't explode because of this. Replacing the ST4 with 8x STR is NOT better, that would be a big performance downside, especially for the quirky hi-silicon hardware. 3) The other series changing the S2 so that WC can work in the VM > Mark did post such an example that has the potential of having that > improvement. I'd suggest that you give it a go. Mark's patch doesn't help this, I've already written and evaluated his patch last week. Unfortunately it needs to be done with explicit inline assembly either STP or STR blocks. I don't know if the 8x STR is workable or not. I need to get someone to test it, but even if it is the userspace IO for this HW will continue to use ST4. So, regardless of the kernel decision, if someone is going to put this HW into a VM then their VMM needs to do *something* to ensure that the VMM does not malfunction when the VM issues STP/ST4 to the VFIO MMIO. There are good choices for the VMM here - ensuring it never has to process a S2 VFIO MMIO fault, always resuming and never emulating VFIO MMIO, or correctly handling an emulated S2 fault from a STP/ST4 instruction via instruction parsing. Therefore we can assume that working VMM's will exist. Indeed I would go farther and say that mlx5 HW in a VM must have a working VMM. So the question is only how pessimistic should the arch code for __iowrite64_copy() be. My view is that it is only used in a small number of drivers and if a VMM creates vPCI devices for those drivers then the VMM should be expected to bring proper vMMIO support too. I do not like this notion that all drivers using __iowrite64_copy() should have sub-optimal bare metal performance because a VMM *might* exist that has a problem. > But your attitude of "who cares if it breaks as long as it works for > me" is not something I can adhere to. In my world failing to reach performance is a "break" as well. So you have a server that is "broken" because its performance is degraded vs an unknown VMM that is "broken" because it wants to emulate IO (without implementing instruction parsing) for a device with a __iowrite64_copy() using driver. My server really does exist. I'm not so sure about the other case. Jason