Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 24 Jan 2024 21:29:24 -0400

On Wed, Jan 24, 2024 at 05:54:49PM +0000, Catalin Marinas wrote:
> On Wed, Jan 24, 2024 at 11:52:25AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 24, 2024 at 01:32:22PM +0000, Marc Zyngier wrote:
> > > What I'm saying is that there are way to make it better without
> > > breaking your particular toy workload which, as important as it may be
> > > to *you*, doesn't cover everybody's use case.
> > 
> > Please, do we need the "toy" stuff? The industry is spending 10's of
> > billions of dollars right now to run "my workload". Currently not
> > widely on ARM servers, but we are all hoping ARM can succeed here,
> > right?
> > 
> > I still don't know what you mean by "better". There are several issues
> > now
> > 
> > 1) This series, where WC doesn't trigger on new cores. Maybe 8x STR
> >    will fix it, but it is not better performance wise than 4x STP.
> 
> It would be good to know. If the performance difference is significant,
> we can revisit. I'm not keen on using alternatives here without backing
> it up by numbers (do we even have a way to detect whether Linux is
> running natively or not? we may have to invent something).

I don't have a setup to measure performance, mlx5 is not using it in a
performance path. The other drivers in the tree are. I feel bad about
hobbling them.

> > 2) Userspace does ST4 to MMIO memory, and the VMM can't explode
> >    because of this. Replacing the ST4 with 8x STR is NOT better,
> >    that would be a big performance downside, especially for the
> >    quirky hi-silicon hardware.
> 
> I was hoping KVM injects an error into the guest rather than killing it
> but at a quick look I couldn't find it. The kvm_handle_guest_abort() ->
> io_mem_abort() ends up returning -ENOSYS while handle_trap_exceptions()
> only understands handled or not (like 1 or 0). Well, maybe I didn't look
> deep enough.

It looks to me like qemu turns on the KVM_CAP_ARM_NISV_TO_USER and
then when it gets a NISV it always converts it to a data abort to the
guest. See kvm_arm_handle_dabt_nisv() in qemu. So it is just a
correctness issue, not a 'VM userspace can crash the VMM' security
problem.

The reason we've never seen this fault in any of our testing is
because the whole system is designed to have qemu back vMMIO space
that is under hot path use by only a VFIO memslot. ie it never drops
the memslot and forces emulation. (KVM has no issue to handle a S2
abort if a memslot is present, obviously)

VFIO IO emulation is used to cover corner cases and establish a slow
technical correctness. It is not fast path. Avoid this if you want any
sort of performance.

Thus, IMHO, doing IO emulation for VFIO that doesn't support all the
instructions actual existing SW uses to do IO is hard to justify. We
are already on a slow path that only exists for technical correctness,
it should be perfect. It is perfect on x86 because x86 KVM does SW
instruction decode and emulation. ARM could too, but doesn't.

To put it in a practical example, I predict that if someone steps
outside our "engineered" box and runs a 64k page size hypervisor
kernel with a mlx5 device that is not engineered for 64K page size
they will get a MMIO BAR layout where the 64k page that covers the MSI
items will overlap with hot path addresses. The existing user space
stack could issue ST4's to hot path addresses within that emulated 64k
of vMMIO and explode. 4k page size hypervisors avoid this because the
typical mlx5 device has a BAR layout with a 4k granule in mind.

Jason