Re: [RFC] vduse config write support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



* Jason Wang <jasowang@xxxxxxxxxx> [2024-07-26 10:47:59]:

> On Wed, Jul 24, 2024 at 11:45???AM Srivatsa Vaddagiri
> <quic_svaddagi@xxxxxxxxxxx> wrote:
> >
> > Currently vduse does not seem to support configuration space writes
> > (vduse_vdpa_set_config does nothing). Is there any plan to lift that
> > limitation? I am aware of the discussions that took place here:
> >
> > https://patchwork.kernel.org/project/netdevbpf/patch/20210615141331.407-11-xieyongji@xxxxxxxxxxxxx/
> >
> > Perhaps writes can be supported *selectively* without violating safety concerns
> > expressed in the above email discussion?
> 
> Adding more relevant people here.
> 
> It can probably be done case by case. The main reason for avoiding
> config writing is
> 
> 1) to prevent buggy/malicious userspace from hanging kernel driver for ever
> 2) to prevent buggy/malicious userspace device to break the semantic
> 
> Basically, it is the traditional trust model where the kernel doesn't
> trust userspace.
> 
> E.g current virtio-blk has the following codes:
> 
> tatic ssize_t
> cache_type_store(struct device *dev, struct device_attribute *attr,
>          const char *buf, size_t count)
> {
>         struct gendisk *disk = dev_to_disk(dev);
>         struct virtio_blk *vblk = disk->private_data;
>         struct virtio_device *vdev = vblk->vdev;
>         int i;
> 
>         BUG_ON(!virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_CONFIG_WCE));
>         i = sysfs_match_string(virtblk_cache_types, buf);
>         if (i < 0)
>         return i;
> 
>         virtio_cwrite8(vdev, offsetof(struct virtio_blk_config, wce), i);
>         virtblk_update_cache_mode(vdev);
>         return count;
> }
> 
> So basically the question is if we make the config write a posted
> write or non-posted one.
> 
> If we make vduse config write a posted one, it means vduse doesn't
> need to wait for the usersapce response. This is much more safe but it
> breaks the above code as it looks like it requires a non-posted
> semantic. We need to filter out virtio-blk or change the code to have
> a read after the write:
> 
> while (virtio_cread8(vdev, offsetof(struct virtio_blk_config, wce) & XXX)
>        /* sleep or other */
>
> But we may suffer from the userspace who doesn't update the wce bit
> forever, or maybe we can have a timeout there.
> 
> If we make vduse config write a non posted one, it means the vduse
> needs to wait for the response from the userspace. 

I was thinking only a read followed by write will needs to stall on a response
from userspace, but yes we will have the same issue of dealing with buggy
userspace. At least for PCI, I think there is provision to use surprise removal
facility, where guest can be notified of malfunctioning device. In case of
userspace (vduse daemon) not responding in time, can the surprise removal event
be injected by VMM (hypervisor in our case)? Guest can decide what to do with it
next.

> It satisfies the
> above code's assumption but it needs to deal with the buggy userspace
> which might be challenging.  Technically, we can have a device
> emulation in the kernel but it looks like overkill for wce (or I don't
> know how it can mandate wce for userspace devices).
> 
> I feel it might make sense for other devices that only require posted
> write semantics.
> 
> >
> > We are thinking of using vduse for hypervisor assisted virtio devices, which
> > may need config write support and hence this question.
> >
> > To provide more details, we have a untrusted host that spins off a protected
> > (confidential) guest VM on a Type-1 hypervisor (Gunyah). VMM in untrusted host
> > leads to couple of issues:
> >
> > 1) Latency of (virtio) register access. VMM can take too long to respond with
> > VCPU stalled all that while. I think vduse shares a similar concern, due to
> > which it maintains a cache of configuratin registers inside kernel.
> 
> Maybe you can give an example for this? We cache the configuration
> space to a faster access to that.

Yes for the same reason of faster access, we wish to have config information
cached in hypervisor. In case of VDUSE,

Guest read -> (VM exit) -> VHOST_VDPA_GET_CONFIG -> vduse_vdpa_get_config

Basically a guest read terminates in host kernel, before resuming.

In our case,

Guest read - (VM exit) - Hyp emulates read - (VM resume)

So a guest read would terminate in hypervisor itself, before it is resumed. 

Without this optimization, guest VCPU would have stalled until VMM in host can
emulate it, which can be long, especially a concern when the read is issued in
hot path (interrupt handler, w/o MSI_X).

> > 2) For PCI pass-through devices, we are concerned of letting VMM be in charge of
> > emulating the complete configuration space (how can VM defend against invalid
> > attributes presented for passthr devices)?
> 
> Virtio driver has been hardened for this, for example:
> 
> commit 72b5e8958738aaa453db5149e6ca3bcf416023b9
> Author: Jason Wang <jasowang@xxxxxxxxxx>
> Date:   Fri Jun 4 13:53:50 2021 +0800
> 
>     virtio-ring: store DMA metadata in desc_extra for split virtqueue
> 
> More hardening work is ongoing.

Any additional pointers you can share? I will go over them and revert.

> > I am aware of TDISP, but I think it
> > may not be available for some of the devices on our platform.
> >
> > One option we are considering is for hypervisor to be in charge of virtio-PCI
> > bus emulation, allowing only select devices (with recognized features) to be
> > registered on the bus. VMM would need to register devices/features with
> > hypervisor, which would verify it (as per some policy) and present them to VM on
> > the virtio-PCI bus it would emulate. Protected VM should be shielded from
> > invalid device configuration information that it may otherwise read from a
> > compromised VMM.
> >
> > For virtio devices, the hypervisor would also service most register read/writes
> > (to address concern #1), which implies it would need to cache a copy of the
> > device configuration information (similar to vduse).
> >
> > We think vduse can be leveraged here to initialize the hypervisor cache of
> > virtio registers. Basically have a vdpa-gunyah driver registered on the vdpa
> > bus to which vduse devices are bound (rather than virtio-vdpa or vhost-vdpa).
> > vdpa-gunyah driver can pull configuration information from vduse and pass that
> > on to hypervisor. It will also help inject IRQ and pass on queue notifications
> > (using hypervisor specific APIs).
> 
> Just to make sure I understand the design here, is vdpa-gunyah
> expected to have a dedicated uAPI other than vhost-vDPA? 

I didn't think vdpa-gunyah would need to provide any UAPI. VDUSE daemon
functionality would be inside VMM itself. For example, the virtio-block backend
in VMM would create a VDUSE device and pass on key configuration information to
VDUSE kernel module. The vdpa device created subsequently (vdpa dev add ..) 
will be bound to vdpa-gunyah driver, which pulls the configuration information
and passes on to hypervisor, which will emulate all further access from VM. 

> Wonder any reason why vhost-vDPA can be used here.

I think vhost-vDPA will imply that VMM still tracks the device on its PCI/MMIO
transports and will be involved for any register read from VM? We didn't want
VMM to be involved in emulating register access of VM. Instead hyp will emulate
most of those accesses.  I need to think more, but I am thinking the VMM will
not even track this device on its PCI/MMIO transports, but rather on a
different, vdpa?, transport.

> > We will however likely need vduse to support configuration writes (guest VM
> > updating configuration space, for ex: writing to 'events_clear' field in case of
> > virtio-gpu). Would vduse maintainers be willing to accept config_write support
> > for select devices/features (as long as the writes don't violate any safety
> > concerns we may have)?
> 
> I think so, looking at virtio_gpu_config_changed_work_func(), the
> events_clear seems to be fine to have a posted semantic.
> 
> Maybe you can post an RFC to support config writing and let's start from there?

Ok thanks for your feedback!

- vatsa




[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux