On Fri, Sep 2, 2022 at 2:14 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > On Fri, Sep 02, 2022 at 02:03:22PM +0800, Jason Wang wrote: > > On Fri, Aug 26, 2022 at 2:24 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote: > > > > > > > > > > > > On 8/22/2022 8:26 PM, Jason Wang wrote: > > > > On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@xxxxxxxxx> wrote: > > > >> > > > >> > > > >> On 8/20/2022 4:55 PM, Si-Wei Liu wrote: > > > >>> > > > >>> On 8/18/2022 5:42 PM, Jason Wang wrote: > > > >>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> > > > >>>> wrote: > > > >>>>> > > > >>>>> On 8/17/2022 9:15 PM, Jason Wang wrote: > > > >>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道: > > > >>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote: > > > >>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote: > > > >>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote: > > > >>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote: > > > >>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote: > > > >>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1 > > > >>>>>>>>>>>> because of > > > >>>>>>>>>>>> transitional devices, so maybe this is the best we can do for > > > >>>>>>>>>>>> now > > > >>>>>>>>>>> I think vhost generally needs an API to declare config space > > > >>>>>>>>>>> endian-ness > > > >>>>>>>>>>> to kernel. vdpa can reuse that too then. > > > >>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the > > > >>>>>>>>>> endian-ness, > > > >>>>>>>>>> for vDPA, I think only the vendor driver knows the endian, > > > >>>>>>>>>> so we may need a new function vdpa_ops->get_endian(). > > > >>>>>>>>>> In the last thread, we say maybe it's better to add a comment for > > > >>>>>>>>>> now. > > > >>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can > > > >>>>>>>>>> work > > > >>>>>>>>>> on it for sure! > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks > > > >>>>>>>>>> Zhu Lingshan > > > >>>>>>>>> I think QEMU has to set endian-ness. No one else knows. > > > >>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only > > > >>>>>>>> the device & driver knows the endian, I think we can not > > > >>>>>>>> "set" a hardware's endian. > > > >>>>>>> QEMU knows the guest endian-ness and it knows that > > > >>>>>>> device is accessed through the legacy interface. > > > >>>>>>> It can accordingly send endian-ness to the kernel and > > > >>>>>>> kernel can propagate it to the driver. > > > >>>>>> I wonder if we can simply force LE and then Qemu can do the endian > > > >>>>>> conversion? > > > >>>>> convert from LE for config space fields only, or QEMU has to forcefully > > > >>>>> mediate and covert endianness for all device memory access including > > > >>>>> even the datapath (fields in descriptor and avail/used rings)? > > > >>>> Former. Actually, I want to force modern devices for vDPA when > > > >>>> developing the vDPA framework. But then we see requirements for > > > >>>> transitional or even legacy (e.g the Ali ENI parent). So it > > > >>>> complicates things a lot. > > > >>>> > > > >>>> I think several ideas has been proposed: > > > >>>> > > > >>>> 1) Your proposal of having a vDPA specific way for > > > >>>> modern/transitional/legacy awareness. This seems very clean since each > > > >>>> transport should have the ability to do that but it still requires > > > >>>> some kind of mediation for the case e.g running BE legacy guest on LE > > > >>>> host. > > > >>> In theory it seems like so, though practically I wonder if we can just > > > >>> forbid BE legacy driver from running on modern LE host. For those who > > > >>> care about legacy BE guest, they mostly like could and should talk to > > > >>> vendor to get native BE support to achieve hardware acceleration, > > > > The problem is the hardware still needs a way to know the endian of the guest? > > > Hardware doesn't need to know. VMM should know by judging from VERSION_1 > > > feature bit negotiation and legacy interface access (with new code), and > > > the target architecture endianness (the latter is existing QEMU code). > > > > > > > >>> few > > > >>> of them would count on QEMU in mediating or emulating the datapath > > > >>> (otherwise I don't see the benefit of adopting vDPA?). I still feel > > > >>> that not every hardware vendor has to offer backward compatibility > > > >>> (transitional device) with legacy interface/behavior (BE being just > > > >>> one), > > > > Probably, I agree it is a corner case, and dealing with transitional > > > > device for the following setups is very challenge for hardware: > > > > > > > > - driver without IOMMU_PLATFORM support, (requiring device to send > > > > translated request which have security implications) > > > Don't get better suggestion for this one, but I presume this is > > > something legacy guest users should be aware of ahead in term of > > > security implications. > > > > Probably but I think this assumption will prevent the device from > > being used in a production environment. > > > > > > > > > - BE legacy guest on LE host, (requiring device to have a way to know > > > > the endian) > > > Yes. device can tell apart with the help from VMM (judging by VERSION_1 > > > acknowledgement and if legacy interface is used during negotiation). > > > > > > > - device specific requirement (e.g modern virtio-net mandate minimal > > > > header length to contain mrg_rxbuf even if the device doesn't offer > > > > it) > > > This one seems to be spec mandated transitional interface requirement? > > > > Yes. > > > > > Which vDPA hardware vendor should take care of rather (if they do offer > > > a transitional device)? > > > > Right but this is not the only one. Section 7.4 summaries the > > transitional device conformance which is a very long list for vendor > > to follow. > > > > > > > > > > It is not obvious for the hardware vendor, so we may end up defecting > > > > in the implementation. Dealing with compatibility for the transitional > > > > devices is kind of a nightmare which there's no way for the spec to > > > > rule the behavior of legacy devices. > > > The compatibility detection part is tedious I agree. That's why I > > > suggested starting from the very minimal and practically feasible (for > > > e.g. on x86), but just don't prohibit the possibility to extend to big > > > endian or come up with quirk fixes for various cases in QEMU. > > > > This is somehow we've already been done, e.g ali ENI is limited to x86. > > > > > > > > > > > > >>> this is unlike the situation on software virtio device, which > > > >>> has legacy support since day one. I think we ever discussed it before: > > > >>> for those vDPA vendors who don't offer legacy guest support, maybe we > > > >>> should mandate some feature for e.g. VERSION_1, as these devices > > > >>> really don't offer functionality of the opposite side (!VERSION_1) > > > >>> during negotiation. > > > > I've tried something similar here (a global mandatory instead of per device). > > > > > > > > https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/4/26__;!!ACWV5N9M2RV99hQ!NRQPfj5o9o3MKE12ze1zfXMC-9SqwOWqF26g8RrIyUDbUmwDIwl5WQCaNiDe6aZ2yR83j-NEqRXQNXcNyOo$ > > > > > > > > But for some reason, it is not applied by Michael. It would be a great > > > > relief if we support modern devices only. Maybe it's time to revisit > > > > this idea then we can introduce new backend features and then we can > > > > mandate VERSION_1 > > > Probably, mandating per-device should be fine I guess. > > > > > > > > > > >>> Having it said, perhaps we should also allow vendor device to > > > >>> implement only partial support for legacy. We can define "reversed" > > > >>> backend feature to denote some part of the legacy > > > >>> interface/functionality not getting implemented by device. For > > > >>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG, > > > >>> VHOST_BACKEND_F_NO_ALIGNED_VRING, > > > >>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these > > > >>> missing features for legacy would be easy for QEMU to make up for, so > > > >>> QEMU can selectively emulate those at its best when necessary and > > > >>> applicable. In other word, this design shouldn't prevent QEMU from > > > >>> making up for vendor device's partial legacy support. > > > > This looks too heavyweight since it tries to provide compatibility for > > > > legacy drivers. > > > That's just for the sake of extreme backward compatibility, but you can > > > say that's not even needed if we mandate transitional device to offer > > > all required interfaces for both legacy and modern guest. > > > > > > > Considering we've introduced modern devices for 5+ > > > > years, I'd rather: > > > > > > > > - Qemu to mediate the config space stuffs > > > > - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring > > > > can perform very well if we do zero-copy). > > > This is one way to achieve, though not sure we should stick the only > > > hope to zero-copy, which IMHO may take a long way to realize and > > > optimize to where a simple datapath passthrough can easily get to (with > > > hardware acceleration of coz). > > > > Note that, current shadow virtqueue is zerocopy, Qemu just need to > > forward the descriptors. > > > > > > > > > > > > >>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we > > > >>>> need a new config ops for vDPA bus, but it doesn't solve the issue for > > > >>>> config space (at least from its name). We probably need a new ioctl > > > >>>> for both vring and config space. > > > >>> Yep adding a new ioctl makes things better, but I think the key is not > > > >>> the new ioctl. It's whether or not we should enforce every vDPA vendor > > > >>> driver to implement all transitional interfaces to be spec compliant. > > > > I think the answer is no since the spec allows transitional device. > > > > And we know things will be greatly simplified if vDPA support non > > > > transitional device only. > > > > > > > > So we can change the question to: > > > > > > > > 1) do we need (or is it too late) to enforce non transitional device? > > > We already have Alibaba ENI which is sort of a quasi-transitional > > > device, right? In the sense it doesn't advertise VERSION_1. I know the > > > other parts might not qualify it to be fully transitional, but code now > > > doesn't prohibit it from supporting VERSION_1 modern interface depending > > > on whatever future need. > > > > We can ask ENI developer for their future plan, it looks to me a > > legacy only device wouldn't be interested in the future. > > > > Zongyong, do you have the plan to implement device with VERSION_1 support? > > > > > > 2) if yes, can transitional device be mediate in an efficient way? > > > > > > > > For 1), it's probably too late but we can invent new vDPA features as > > > > you suggest to be non transitional. Then we can: > > > > > > > > 1.1) extend the netlink API to provision non-transitonal device > > > Define non-transitional: device could be either modern-only or legacy-only? > > > > According to the spec, non-transitional should be modern only. > > > > > > 1.2) work on the non-transtional device in the future > > > > 1.3) presenting transitional device via mediation > > > presenting transitional on top of a modern device with VERSION_1, right? > > > > Yes, I mean presenting/mediating a transitional device on top of a > > non-transitional device. > > > > > What if the hardware device can support legacy-compatible datapath > > > natively that doesn't need mediation? Can it be done with direct > > > datapath passthrough without svq involvement? > > > > I'd like to avoid supporting legacy-only device like ENI in the > > future. The major problem is that it's out of the spec thus the > > behaviour is defined by the code not the spec. > > > > > > > > > > > > > The previous transitional vDPA work as is, it's probably too late to > > > > fix all the issue we suffer. > > > What do you mean work as-is, > > > > See above, basically I mean the behaviour is defined by the vDPA code > > not (or can't) by the spec. > > > > For example, for virtio-pci, we have: > > > > legacy interface: BAR > > modern interface: capability > > > > So a transitional device can simple provide both of those interfaces: > > E.g for device configuration space, if it is accessed via legacy > > interface device know it needs to provide the config with native > > endian otherwise little endian when accessing via modern interface. > > > > For virtio-mmio, it looks to me it doesn't provide a way for > > transitional device. > > > > For vDPA, we actually don't define whether config_ops is a modern or > > legacy interface. This is very tricky for the transitional device > > since it tries to reuse the same interface for both legacy and modern > > which make it very hard to be correct. E.g: > > > > 1) VERSION_1 trick won't work, e.g the spec allows to read device > > configuration space before FEATURE_OK. So legacy driver may assume a > > native endian. > > I am trying to address this in the spec. There was a fairly narrow > window during which guests accessed config space before > writing out features. Yes before FEATURES_OK but I think > asking that guests send the features to device > before poking at config space that depends on those > features is reasonable. I'm not sure I get this. For transitional devices, we have legacy interfaces so legacy drivers should work anyhow even without the fix. > > Similarly, we can also change QEMU to send > features to vdpa on config access that happens before > FEATURES_OK. This only works when the guest driver sets a feature before config space accessing. And then vDPA presents LE if VERSION_1 is negotiated? But at the API level, vhost-vDPA still needs to handle the case when VHOST_VDPA_GET_CONFIG is called before VHOST_SET_FEATURES. Mandating a LE looks a better way then QEMU can mediate in the middle, e.g converting to BE when necessary. > > > > > 2) SET_VRING_ENDIAN doesn't fix all the issue, there's still a > > question what endian it is before SET_VRING_ENDIAN (or we need to > > support userspace without SET_VRING_ENDIAN) > > ... > > > > Things will be simplified if we mandate the config_ops as the modern > > interface and provide the necessary mediation in the hypervisor. > > > > > what's the nomenclature for that, > > > quasi-transitional or broken-transitional? > > > > If we invent new API to clarify the modern/legacy and focus on the > > modern in the future, we probably don't need a name? > > OK. What will that API be like? Maybe a bit in PROTOCOL_FEATURES? Something like this, a bit via VHOST_GET_BACKEND_FEATURES. Thanks > > > > and what are the outstanding > > > issues you anticipate remaining? > > > > Basically we need to check the conformance listed in section 7.4 of the spec. > > > > > If it is device specific or vendor > > > driver specific, let it be. But I wonder if there's any generic one that > > > has to be fixed in vdpa core to support a truly transitional model. > > > > Two set of config_ops? E.g > > > > legacy_config_ops > > modern_config_ops > > > > But I'm not sure whether or not it's worthwhile. > > > > > > > > > > For 2), the key part is the datapath mediation, we can use shadow virtqueue. > > > Sure. For our use case, we'd care more in providing transitional rather > > > than being non-transitional. So, one device fits all. > > > > > > Thanks for all the ideas. This discussion is really useful. > > > > Appreciate for the discussion. > > > > Thanks > > > > > > > > Best, > > > -Siwei > > > > > > > >>> If we allow them to reject the VHOST_SET_VRING_ENDIAN or > > > >>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up > > > >>> with same situation of either fail the guest, or trying to > > > >>> mediate/emulate, right? > > > >>> > > > >>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost > > > >>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled > > > >>> and QEMU just ignores the result. vhost doesn't necessarily depend on > > > >>> it to determine endianness it looks. > > > >> I would like to suggest to add two new config ops get/set_vq_endian() > > > >> and get/set_config_endian() for vDPA. This is used to: > > > >> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add > > > >> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa. > > > >> If the device has not implemented interface to set its endianess, then > > > >> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness > > > >> anyway. > > > > How can Qemu know the endian in this way? And if it can, there's no > > > > need for the new API? > > > > > > > >> In this case, if the device endianess does not match the guest, > > > >> there needs a mediation layer or fail. > > > >> b) ops->get_config_endian() can always tell the endian-ness of the > > > >> device config space after the vendor driver probing the device. So we > > > >> can use this ops->get_config_endian() for > > > >> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we > > > >> don't need to set_features in vdpa_get_config_unlocked(), so no race > > > >> conditions. > > > >> Every time ops->get_config() returned, we can tell the endian by > > > >> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features > > > >> negotiation not done. > > > >> > > > >> The question is: Do we need two pairs of ioctls for both vq and config > > > >> space? Can config space endian-ness differ from the vqs? > > > >> c) do we need a new netlink attr telling the endian-ness to user space? > > > > Generally, I'm not sure this is a good design consider it provides neither: > > > > > > > > Compatibility with the virtio spec > > > > > > > > nor > > > > > > > > Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN) > > > > > > > > Thanks > > > > > > > >> Thanks, > > > >> Zhu Lingshan > > > >>>> or > > > >>>> > > > >>>> 3) revisit the idea of forcing modern only device which may simplify > > > >>>> things a lot > > > >>> I am not actually against forcing modern only config space, given that > > > >>> it's not hard for either QEMU or individual driver to mediate or > > > >>> emulate, and for the most part it's not conflict with the goal of > > > >>> offload or acceleration with vDPA. But forcing LE ring layout IMO > > > >>> would just kill off the potential of a very good use case. Currently > > > >>> for our use case the priority for supporting 0.9.5 guest with vDPA is > > > >>> slightly lower compared to live migration, but it is still in our TODO > > > >>> list. > > > >>> > > > >>> Thanks, > > > >>> -Siwei > > > >>> > > > >>>> which way should we go? > > > >>>> > > > >>>>> I hope > > > >>>>> it's not the latter, otherwise it loses the point to use vDPA for > > > >>>>> datapath acceleration. > > > >>>>> > > > >>>>> Even if its the former, it's a little weird for vendor device to > > > >>>>> implement a LE config space with BE ring layout, although still > > > >>>>> possible... > > > >>>> Right. > > > >>>> > > > >>>> Thanks > > > >>>> > > > >>>>> -Siwei > > > >>>>>> Thanks > > > >>>>>> > > > >>>>>> > > > >>>>>>>> So if you think we should add a vdpa_ops->get_endian(), > > > >>>>>>>> I will drop these comments in the next version of > > > >>>>>>>> series, and work on a new patch for get_endian(). > > > >>>>>>>> > > > >>>>>>>> Thanks, > > > >>>>>>>> Zhu Lingshan > > > >>>>>>> Guests don't get endian-ness from devices so this seems pointless. > > > >>>>>>> > > > >