Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 24 Jan 2024 11:52:25 -0400

On Wed, Jan 24, 2024 at 01:32:22PM +0000, Marc Zyngier wrote:
> > So, I'm fine if the answer is that VMM's using VFIO need to use
> > KVM_CAP_ARM_NISV_TO_USER and do instruction parsing for emulated IO in
> > userspace if they have a design where VFIO MMIO can infrequently
> > generate faults. That is all VMM design stuff and has nothing to do
> > with the kernel.
> 
> Which will work a treat with things like CCA, I'm sure.

CCA shouldn't have emulation or trapping on the MMIO mappings.

> > > Or you can stop whining and try to get better performance out of what
> > > we have today.
> > 
> > "better performance"!?!? You are telling me I have to destroy one of
> > our important fast paths for HPC workloads to accommodate some
> > theoretical ARM KVM problem?
> 
> What I'm saying is that there are way to make it better without
> breaking your particular toy workload which, as important as it may be
> to *you*, doesn't cover everybody's use case.

Please, do we need the "toy" stuff? The industry is spending 10's of
billions of dollars right now to run "my workload". Currently not
widely on ARM servers, but we are all hoping ARM can succeed here,
right?

I still don't know what you mean by "better". There are several issues
now

1) This series, where WC doesn't trigger on new cores. Maybe 8x STR
   will fix it, but it is not better performance wise than 4x STP.

2) Userspace does ST4 to MMIO memory, and the VMM can't explode
   because of this. Replacing the ST4 with 8x STR is NOT better,
   that would be a big performance downside, especially for the
   quirky hi-silicon hardware.

3) The other series changing the S2 so that WC can work in the VM

> Mark did post such an example that has the potential of having that
> improvement. I'd suggest that you give it a go.

Mark's patch doesn't help this, I've already written and evaluated his
patch last week. Unfortunately it needs to be done with explicit
inline assembly either STP or STR blocks.

I don't know if the 8x STR is workable or not. I need to get someone
to test it, but even if it is the userspace IO for this HW will
continue to use ST4.

So, regardless of the kernel decision, if someone is going to put this
HW into a VM then their VMM needs to do *something* to ensure that the
VMM does not malfunction when the VM issues STP/ST4 to the VFIO MMIO.

There are good choices for the VMM here - ensuring it never has to
process a S2 VFIO MMIO fault, always resuming and never emulating VFIO
MMIO, or correctly handling an emulated S2 fault from a STP/ST4
instruction via instruction parsing.

Therefore we can assume that working VMM's will exist. Indeed I would
go farther and say that mlx5 HW in a VM must have a working VMM.

So the question is only how pessimistic should the arch code for
__iowrite64_copy() be. My view is that it is only used in a small
number of drivers and if a VMM creates vPCI devices for those drivers
then the VMM should be expected to bring proper vMMIO support too.

I do not like this notion that all drivers using __iowrite64_copy()
should have sub-optimal bare metal performance because a VMM *might*
exist that has a problem.

> But your attitude of "who cares if it breaks as long as it works for
> me" is not something I can adhere to.

In my world failing to reach performance is a "break" as well.

So you have a server that is "broken" because its performance is
degraded vs an unknown VMM that is "broken" because it wants to
emulate IO (without implementing instruction parsing) for a device
with a __iowrite64_copy() using driver.

My server really does exist. I'm not so sure about the other case.

Jason