* Jason Wang <jasowang@xxxxxxxxxx> [2024-07-26 10:47:59]: > On Wed, Jul 24, 2024 at 11:45???AM Srivatsa Vaddagiri > <quic_svaddagi@xxxxxxxxxxx> wrote: > > > > Currently vduse does not seem to support configuration space writes > > (vduse_vdpa_set_config does nothing). Is there any plan to lift that > > limitation? I am aware of the discussions that took place here: > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20210615141331.407-11-xieyongji@xxxxxxxxxxxxx/ > > > > Perhaps writes can be supported *selectively* without violating safety concerns > > expressed in the above email discussion? > > Adding more relevant people here. > > It can probably be done case by case. The main reason for avoiding > config writing is > > 1) to prevent buggy/malicious userspace from hanging kernel driver for ever > 2) to prevent buggy/malicious userspace device to break the semantic > > Basically, it is the traditional trust model where the kernel doesn't > trust userspace. > > E.g current virtio-blk has the following codes: > > tatic ssize_t > cache_type_store(struct device *dev, struct device_attribute *attr, > const char *buf, size_t count) > { > struct gendisk *disk = dev_to_disk(dev); > struct virtio_blk *vblk = disk->private_data; > struct virtio_device *vdev = vblk->vdev; > int i; > > BUG_ON(!virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_CONFIG_WCE)); > i = sysfs_match_string(virtblk_cache_types, buf); > if (i < 0) > return i; > > virtio_cwrite8(vdev, offsetof(struct virtio_blk_config, wce), i); > virtblk_update_cache_mode(vdev); > return count; > } > > So basically the question is if we make the config write a posted > write or non-posted one. > > If we make vduse config write a posted one, it means vduse doesn't > need to wait for the usersapce response. This is much more safe but it > breaks the above code as it looks like it requires a non-posted > semantic. We need to filter out virtio-blk or change the code to have > a read after the write: > > while (virtio_cread8(vdev, offsetof(struct virtio_blk_config, wce) & XXX) > /* sleep or other */ > > But we may suffer from the userspace who doesn't update the wce bit > forever, or maybe we can have a timeout there. > > If we make vduse config write a non posted one, it means the vduse > needs to wait for the response from the userspace. I was thinking only a read followed by write will needs to stall on a response from userspace, but yes we will have the same issue of dealing with buggy userspace. At least for PCI, I think there is provision to use surprise removal facility, where guest can be notified of malfunctioning device. In case of userspace (vduse daemon) not responding in time, can the surprise removal event be injected by VMM (hypervisor in our case)? Guest can decide what to do with it next. > It satisfies the > above code's assumption but it needs to deal with the buggy userspace > which might be challenging. Technically, we can have a device > emulation in the kernel but it looks like overkill for wce (or I don't > know how it can mandate wce for userspace devices). > > I feel it might make sense for other devices that only require posted > write semantics. > > > > > We are thinking of using vduse for hypervisor assisted virtio devices, which > > may need config write support and hence this question. > > > > To provide more details, we have a untrusted host that spins off a protected > > (confidential) guest VM on a Type-1 hypervisor (Gunyah). VMM in untrusted host > > leads to couple of issues: > > > > 1) Latency of (virtio) register access. VMM can take too long to respond with > > VCPU stalled all that while. I think vduse shares a similar concern, due to > > which it maintains a cache of configuratin registers inside kernel. > > Maybe you can give an example for this? We cache the configuration > space to a faster access to that. Yes for the same reason of faster access, we wish to have config information cached in hypervisor. In case of VDUSE, Guest read -> (VM exit) -> VHOST_VDPA_GET_CONFIG -> vduse_vdpa_get_config Basically a guest read terminates in host kernel, before resuming. In our case, Guest read - (VM exit) - Hyp emulates read - (VM resume) So a guest read would terminate in hypervisor itself, before it is resumed. Without this optimization, guest VCPU would have stalled until VMM in host can emulate it, which can be long, especially a concern when the read is issued in hot path (interrupt handler, w/o MSI_X). > > 2) For PCI pass-through devices, we are concerned of letting VMM be in charge of > > emulating the complete configuration space (how can VM defend against invalid > > attributes presented for passthr devices)? > > Virtio driver has been hardened for this, for example: > > commit 72b5e8958738aaa453db5149e6ca3bcf416023b9 > Author: Jason Wang <jasowang@xxxxxxxxxx> > Date: Fri Jun 4 13:53:50 2021 +0800 > > virtio-ring: store DMA metadata in desc_extra for split virtqueue > > More hardening work is ongoing. Any additional pointers you can share? I will go over them and revert. > > I am aware of TDISP, but I think it > > may not be available for some of the devices on our platform. > > > > One option we are considering is for hypervisor to be in charge of virtio-PCI > > bus emulation, allowing only select devices (with recognized features) to be > > registered on the bus. VMM would need to register devices/features with > > hypervisor, which would verify it (as per some policy) and present them to VM on > > the virtio-PCI bus it would emulate. Protected VM should be shielded from > > invalid device configuration information that it may otherwise read from a > > compromised VMM. > > > > For virtio devices, the hypervisor would also service most register read/writes > > (to address concern #1), which implies it would need to cache a copy of the > > device configuration information (similar to vduse). > > > > We think vduse can be leveraged here to initialize the hypervisor cache of > > virtio registers. Basically have a vdpa-gunyah driver registered on the vdpa > > bus to which vduse devices are bound (rather than virtio-vdpa or vhost-vdpa). > > vdpa-gunyah driver can pull configuration information from vduse and pass that > > on to hypervisor. It will also help inject IRQ and pass on queue notifications > > (using hypervisor specific APIs). > > Just to make sure I understand the design here, is vdpa-gunyah > expected to have a dedicated uAPI other than vhost-vDPA? I didn't think vdpa-gunyah would need to provide any UAPI. VDUSE daemon functionality would be inside VMM itself. For example, the virtio-block backend in VMM would create a VDUSE device and pass on key configuration information to VDUSE kernel module. The vdpa device created subsequently (vdpa dev add ..) will be bound to vdpa-gunyah driver, which pulls the configuration information and passes on to hypervisor, which will emulate all further access from VM. > Wonder any reason why vhost-vDPA can be used here. I think vhost-vDPA will imply that VMM still tracks the device on its PCI/MMIO transports and will be involved for any register read from VM? We didn't want VMM to be involved in emulating register access of VM. Instead hyp will emulate most of those accesses. I need to think more, but I am thinking the VMM will not even track this device on its PCI/MMIO transports, but rather on a different, vdpa?, transport. > > We will however likely need vduse to support configuration writes (guest VM > > updating configuration space, for ex: writing to 'events_clear' field in case of > > virtio-gpu). Would vduse maintainers be willing to accept config_write support > > for select devices/features (as long as the writes don't violate any safety > > concerns we may have)? > > I think so, looking at virtio_gpu_config_changed_work_func(), the > events_clear seems to be fine to have a posted semantic. > > Maybe you can post an RFC to support config writing and let's start from there? Ok thanks for your feedback! - vatsa