Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion

Jason Wang <jasowang@xxxxxxxxxx> · Mon, 30 Nov 2020 10:14:15 +0800

On 2020/11/27 下午9:44, Stefan Hajnoczi wrote:
On Fri, Nov 27, 2020 at 11:39:23AM +0800, Jason Wang wrote:
On 2020/11/26 下午8:36, Stefan Hajnoczi wrote:
On Thu, Nov 26, 2020 at 11:37:30AM +0800, Jason Wang wrote:
On 2020/11/26 上午3:21, Elena Afanasova wrote:
Hello,

I'm an Outreachy intern with QEMU and I’m working on implementing the
ioregionfd API in KVM.
So I’d like to resume the ioregionfd design discussion. The latest
version of the ioregionfd API document is provided below.

Overview
--------
ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses
over a
file descriptor without returning from ioctl(KVM_RUN). This allows device
emulation to run in another task separate from the vCPU task.

This is achieved through KVM ioctls for registering MMIO/PIO regions and
a wire
protocol that KVM uses to communicate with a task handling an MMIO/PIO
access.

The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation
in a
separate task looks like this:

     kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device
task

ioregionfd improves performance by eliminating the need for the vCPU
task to
forward MMIO/PIO exits to device emulation tasks:
I wonder at which cases we care performance like this. (Note that vhost-user
suppots set|get_config() for a while).
NVMe emulation needs this because ioeventfd cannot transfer the value
written to the doorbell. That's why QEMU's NVMe emulation doesn't
support IOThreads.

I think it depends on how many different value that can be carried via
doorbell. If it's not tons of, we can use datamatch. Anyway virtio support
differing queue index via the value wrote to doorbell.
There are too many value, it's not the queue index. It's the ring index
of the latest request. If the ring size is 128, we need 128 ioeventfd
registrations, etc. It becomes a lot.

I see, it's somehow similar to the NOTIFICATION_DATA feature in virtio.

By the way, the long-term use case for ioregionfd is to allow vfio-user
device emulation processes to directly handle I/O accesses. Elena
benchmarked ioeventfd vs dispatching through QEMU and can share the
perform results. I think the number was around 30+% improvement via
direct ioeventfd dispatch, so it will be important for high IOPS
devices (network and storage controllers).

That's amazing :)

KVM_CREATE_IOREGIONFD
---------------------
:Capability: KVM_CAP_IOREGIONFD
:Architectures: all
:Type: system ioctl
:Parameters: none
:Returns: an ioregionfd file descriptor, -1 on error

This ioctl creates a new ioregionfd and returns the file descriptor. The
fd can
be used to handle MMIO/PIO accesses instead of returning from
ioctl(KVM_RUN)
with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must
be
registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses
on the
fd. An ioregionfd can be used with multiple VMs and its lifecycle is not
tied
to a specific VM.

When the last file descriptor for an ioregionfd is closed, all regions
registered with KVM_SET_IOREGION are dropped and guest accesses to those
regions cause ioctl(KVM_RUN) to return again.
I may miss something, but I don't see any special requirement of this fd.
The fd just a transport of a protocol between KVM and userspace process. So
instead of mandating a new type, it might be better to allow any type of fd
to be attached. (E.g pipe or socket).
pipe(2) is unidirectional on Linux, so it won't work.

Can we accept two file descriptors to make it work?

mkfifo(3) seems usable but creates a node on a filesystem.

socketpair(2) would work, but brings in the network stack when it's not
needed. The advantage is that some future user case might want to direct
ioregionfd over a real socket to a remote host, which would be cool.

Do you have an idea of the performance difference of socketpair(2)
compared to a custom fd?

It should be slower than custom fd and UNIX socket should be faster than
TIPC. Maybe we can have a custom fd, but it's better to leave the policy to
the userspace:

1) KVM should not have any limitation of the fd it uses, user will risk
itself if the fd has been used wrongly, and the custom fd should be one of
the choice
2) it's better to not have a virt specific name (e.g "KVM" or "ioregion")
Okay, it looks like there are things to investigate here.

Elena: My suggestion would be to start with the simplest option -
letting userspace pass in 1 file descriptor. You can investigate the
performance of socketpair(2)/fifo(7), 2 pipe fds, or a custom file
implementation later if time permits. That way the API has maximum
flexibility (userspace can decide on the file type).

Or I wonder whether we can attach an eBPF program when trapping MMIO/PIO and
allow it to decide how to proceed?
The eBPF program approach is interesting, but it would probably require
access to guest RAM and additional userspace state (e.g. device-specific
register values). I don't know the current status of Linux eBPF - is it
possible to access user memory (it could be swapped out)?

AFAIK it doesn't, but just to make sure I understand, any reason that 
eBPF need to access userspace memory here?

Thanks

Stefan