Re: [PATCH 2/2] vDPA: conditionally read fields in virtio-net dev

Jason Wang <jasowang@xxxxxxxxxx> · Mon, 5 Sep 2022 11:54:52 +0800

On Fri, Sep 2, 2022 at 2:14 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>
> On Fri, Sep 02, 2022 at 02:03:22PM +0800, Jason Wang wrote:
> > On Fri, Aug 26, 2022 at 2:24 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote:
> > >
> > >
> > >
> > > On 8/22/2022 8:26 PM, Jason Wang wrote:
> > > > On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@xxxxxxxxx> wrote:
> > > >>
> > > >>
> > > >> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
> > > >>>
> > > >>> On 8/18/2022 5:42 PM, Jason Wang wrote:
> > > >>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
> > > >>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> > > >>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> > > >>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> > > >>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> > > >>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> > > >>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> > > >>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
> > > >>>>>>>>>>>> because of
> > > >>>>>>>>>>>> transitional devices, so maybe this is the best we can do for
> > > >>>>>>>>>>>> now
> > > >>>>>>>>>>> I think vhost generally needs an API to declare config space
> > > >>>>>>>>>>> endian-ness
> > > >>>>>>>>>>> to kernel. vdpa can reuse that too then.
> > > >>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
> > > >>>>>>>>>> endian-ness,
> > > >>>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
> > > >>>>>>>>>> so we may need a new function vdpa_ops->get_endian().
> > > >>>>>>>>>> In the last thread, we say maybe it's better to add a comment for
> > > >>>>>>>>>> now.
> > > >>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
> > > >>>>>>>>>> work
> > > >>>>>>>>>> on it for sure!
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks
> > > >>>>>>>>>> Zhu Lingshan
> > > >>>>>>>>> I think QEMU has to set endian-ness. No one else knows.
> > > >>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
> > > >>>>>>>> the device & driver knows the endian, I think we can not
> > > >>>>>>>> "set" a hardware's endian.
> > > >>>>>>> QEMU knows the guest endian-ness and it knows that
> > > >>>>>>> device is accessed through the legacy interface.
> > > >>>>>>> It can accordingly send endian-ness to the kernel and
> > > >>>>>>> kernel can propagate it to the driver.
> > > >>>>>> I wonder if we can simply force LE and then Qemu can do the endian
> > > >>>>>> conversion?
> > > >>>>> convert from LE for config space fields only, or QEMU has to forcefully
> > > >>>>> mediate and covert endianness for all device memory access including
> > > >>>>> even the datapath (fields in descriptor and avail/used rings)?
> > > >>>> Former. Actually, I want to force modern devices for vDPA when
> > > >>>> developing the vDPA framework. But then we see requirements for
> > > >>>> transitional or even legacy (e.g the Ali ENI parent). So it
> > > >>>> complicates things a lot.
> > > >>>>
> > > >>>> I think several ideas has been proposed:
> > > >>>>
> > > >>>> 1) Your proposal of having a vDPA specific way for
> > > >>>> modern/transitional/legacy awareness. This seems very clean since each
> > > >>>> transport should have the ability to do that but it still requires
> > > >>>> some kind of mediation for the case e.g running BE legacy guest on LE
> > > >>>> host.
> > > >>> In theory it seems like so, though practically I wonder if we can just
> > > >>> forbid BE legacy driver from running on modern LE host. For those who
> > > >>> care about legacy BE guest, they mostly like could and should talk to
> > > >>> vendor to get native BE support to achieve hardware acceleration,
> > > > The problem is the hardware still needs a way to know the endian of the guest?
> > > Hardware doesn't need to know. VMM should know by judging from VERSION_1
> > > feature bit negotiation and legacy interface access (with new code), and
> > > the target architecture endianness (the latter is existing QEMU code).
> > > >
> > > >>> few
> > > >>> of them would count on QEMU in mediating or emulating the datapath
> > > >>> (otherwise I don't see the benefit of adopting vDPA?). I still feel
> > > >>> that not every hardware vendor has to offer backward compatibility
> > > >>> (transitional device) with legacy interface/behavior (BE being just
> > > >>> one),
> > > > Probably, I agree it is a corner case, and dealing with transitional
> > > > device for the following setups is very challenge for hardware:
> > > >
> > > > - driver without IOMMU_PLATFORM support, (requiring device to send
> > > > translated request which have security implications)
> > > Don't get better suggestion for this one, but I presume this is
> > > something legacy guest users should be aware of ahead in term of
> > > security implications.
> >
> > Probably but I think this assumption will prevent the device from
> > being used in a production environment.
> >
> > >
> > > > - BE legacy guest on LE host, (requiring device to have a way to know
> > > > the endian)
> > > Yes. device can tell apart with the help from VMM (judging by VERSION_1
> > > acknowledgement and if legacy interface is used during negotiation).
> > >
> > > > - device specific requirement (e.g modern virtio-net mandate minimal
> > > > header length to contain mrg_rxbuf even if the device doesn't offer
> > > > it)
> > > This one seems to be spec mandated transitional interface requirement?
> >
> > Yes.
> >
> > > Which vDPA hardware vendor should take care of rather (if they do offer
> > > a transitional device)?
> >
> > Right but this is not the only one. Section 7.4 summaries the
> > transitional device conformance which is a very long list for vendor
> > to follow.
> >
> > > >
> > > > It is not obvious for the hardware vendor, so we may end up defecting
> > > > in the implementation. Dealing with compatibility for the transitional
> > > > devices is kind of a nightmare which there's no way for the spec to
> > > > rule the behavior of legacy devices.
> > > The compatibility detection part is tedious I agree. That's why I
> > > suggested starting from the very minimal and practically feasible (for
> > > e.g. on x86), but just don't prohibit the possibility to extend to big
> > > endian or come up with quirk fixes for various cases in QEMU.
> >
> > This is somehow we've already been done, e.g ali ENI is limited to x86.
> >
> > >
> > > >
> > > >>>   this is unlike the situation on software virtio device, which
> > > >>> has legacy support since day one. I think we ever discussed it before:
> > > >>> for those vDPA vendors who don't offer legacy guest support, maybe we
> > > >>> should mandate some feature for e.g. VERSION_1, as these devices
> > > >>> really don't offer functionality of the opposite side (!VERSION_1)
> > > >>> during negotiation.
> > > > I've tried something similar here (a global mandatory instead of per device).
> > > >
> > > > https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/4/26__;!!ACWV5N9M2RV99hQ!NRQPfj5o9o3MKE12ze1zfXMC-9SqwOWqF26g8RrIyUDbUmwDIwl5WQCaNiDe6aZ2yR83j-NEqRXQNXcNyOo$
> > > >
> > > > But for some reason, it is not applied by Michael. It would be a great
> > > > relief if we support modern devices only. Maybe it's time to revisit
> > > > this idea then we can introduce new backend features and then we can
> > > > mandate VERSION_1
> > > Probably, mandating per-device should be fine I guess.
> > >
> > > >
> > > >>> Having it said, perhaps we should also allow vendor device to
> > > >>> implement only partial support for legacy. We can define "reversed"
> > > >>> backend feature to denote some part of the legacy
> > > >>> interface/functionality not getting implemented by device. For
> > > >>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
> > > >>> VHOST_BACKEND_F_NO_ALIGNED_VRING,
> > > >>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
> > > >>> missing features for legacy would be easy for QEMU to make up for, so
> > > >>> QEMU can selectively emulate those at its best when necessary and
> > > >>> applicable. In other word, this design shouldn't prevent QEMU from
> > > >>> making up for vendor device's partial legacy support.
> > > > This looks too heavyweight since it tries to provide compatibility for
> > > > legacy drivers.
> > > That's just for the sake of extreme backward compatibility, but you can
> > > say that's not even needed if we mandate transitional device to offer
> > > all required interfaces for both legacy and modern guest.
> > >
> > > >   Considering we've introduced modern devices for 5+
> > > > years, I'd rather:
> > > >
> > > > - Qemu to mediate the config space stuffs
> > > > - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
> > > > can perform very well if we do zero-copy).
> > > This is one way to achieve, though not sure we should stick the only
> > > hope to zero-copy, which IMHO may take a long way to realize and
> > > optimize to where a simple datapath passthrough can easily get to (with
> > > hardware acceleration of coz).
> >
> > Note that, current shadow virtqueue is zerocopy, Qemu just need to
> > forward the descriptors.
> >
> > >
> > > >
> > > >>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
> > > >>>> need a new config ops for vDPA bus, but it doesn't solve the issue for
> > > >>>> config space (at least from its name). We probably need a new ioctl
> > > >>>> for both vring and config space.
> > > >>> Yep adding a new ioctl makes things better, but I think the key is not
> > > >>> the new ioctl. It's whether or not we should enforce every vDPA vendor
> > > >>> driver to implement all transitional interfaces to be spec compliant.
> > > > I think the answer is no since the spec allows transitional device.
> > > > And we know things will be greatly simplified if vDPA support non
> > > > transitional device only.
> > > >
> > > > So we can change the question to:
> > > >
> > > > 1) do we need (or is it too late) to enforce non transitional device?
> > > We already have Alibaba ENI which is sort of a quasi-transitional
> > > device, right? In the sense it doesn't advertise VERSION_1. I know the
> > > other parts might not qualify it to be fully transitional, but code now
> > > doesn't prohibit it from supporting VERSION_1 modern interface depending
> > > on whatever future need.
> >
> > We can ask ENI developer for their future plan, it looks to me a
> > legacy only device wouldn't be interested in the future.
> >
> > Zongyong, do you have the plan to implement device with VERSION_1 support?
> >
> > > > 2) if yes, can transitional device be mediate in an efficient way?
> > > >
> > > > For 1), it's probably too late but we can invent new vDPA features as
> > > > you suggest to be non transitional. Then we can:
> > > >
> > > > 1.1) extend the netlink API to provision non-transitonal device
> > > Define non-transitional: device could be either modern-only or legacy-only?
> >
> > According to the spec, non-transitional should be modern only.
> >
> > > > 1.2) work on the non-transtional device in the future
> > > > 1.3) presenting transitional device via mediation
> > > presenting transitional on top of a modern device with VERSION_1, right?
> >
> > Yes, I mean presenting/mediating a transitional device on top of a
> > non-transitional device.
> >
> > > What if the hardware device can support legacy-compatible datapath
> > > natively that doesn't need mediation? Can it be done with direct
> > > datapath passthrough without svq involvement?
> >
> > I'd like to avoid supporting legacy-only device like ENI in the
> > future. The major problem is that it's out of the spec thus the
> > behaviour is defined by the code not the spec.
> >
> > >
> > > >
> > > > The previous transitional vDPA work as is, it's probably too late to
> > > > fix all the issue we suffer.
> > > What do you mean work as-is,
> >
> > See above, basically I mean the behaviour is defined by the vDPA code
> > not (or can't) by the spec.
> >
> > For example, for virtio-pci, we have:
> >
> > legacy interface: BAR
> > modern interface: capability
> >
> > So a transitional device can simple provide both of those interfaces:
> > E.g for device configuration space, if it is accessed via legacy
> > interface device know it needs to provide the config with native
> > endian otherwise little endian when accessing via modern interface.
> >
> > For virtio-mmio, it looks to me it doesn't provide a way for
> > transitional device.
> >
> > For vDPA, we actually don't define whether config_ops is a modern or
> > legacy interface. This is very tricky for the transitional device
> > since it tries to reuse the same interface for both legacy and modern
> > which make it very hard to be correct. E.g:
> >
> > 1) VERSION_1 trick won't work, e.g the spec allows to read device
> > configuration space before FEATURE_OK. So legacy driver may assume a
> > native endian.
>
> I am trying to address this in the spec. There was a fairly narrow
> window during which guests accessed config space before
> writing out features. Yes before FEATURES_OK but I think
> asking that guests send the features to device
> before poking at config space that depends on those
> features is reasonable.

I'm not sure I get this. For transitional devices, we have legacy
interfaces so legacy drivers should work anyhow even without the fix.

>
> Similarly, we can also change QEMU to send
> features to vdpa on config access that happens before
> FEATURES_OK.

This only works when the guest driver sets a feature before config
space accessing. And then vDPA presents LE if VERSION_1 is negotiated?
But at the API level, vhost-vDPA still needs to handle the case when
VHOST_VDPA_GET_CONFIG is called before VHOST_SET_FEATURES.

Mandating a LE looks a better way then QEMU can mediate in the middle,
e.g converting to BE when necessary.

>
>
>
> > 2) SET_VRING_ENDIAN doesn't fix all the issue, there's still a
> > question what endian it is before SET_VRING_ENDIAN (or we need to
> > support userspace without SET_VRING_ENDIAN)
> > ...
> >
> > Things will be simplified if we mandate the config_ops as the modern
> > interface and provide the necessary mediation in the hypervisor.
> >
> > > what's the nomenclature for that,
> > > quasi-transitional or broken-transitional?
> >
> > If we invent new API to clarify the modern/legacy and focus on the
> > modern in the future, we probably don't need a name?
>
> OK. What will that API be like? Maybe a bit in PROTOCOL_FEATURES?

Something like this, a bit via VHOST_GET_BACKEND_FEATURES.

Thanks

>
> > > and what are the outstanding
> > > issues you anticipate remaining?
> >
> > Basically we need to check the conformance listed in section 7.4 of the spec.
> >
> > > If it is device specific or vendor
> > > driver specific, let it be. But I wonder if there's any generic one that
> > > has to be fixed in vdpa core to support a truly transitional model.
> >
> > Two set of config_ops? E.g
> >
> > legacy_config_ops
> > modern_config_ops
> >
> > But I'm not sure whether or not it's worthwhile.
> >
> > > >
> > > > For 2), the key part is the datapath mediation, we can use shadow virtqueue.
> > > Sure. For our use case, we'd care more in providing transitional rather
> > > than being non-transitional. So, one device fits all.
> > >
> > > Thanks for all the ideas. This discussion is really useful.
> >
> > Appreciate for the discussion.
> >
> > Thanks
> >
> > >
> > > Best,
> > > -Siwei
> > > >
> > > >>> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
> > > >>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
> > > >>> with same situation of either fail the guest, or trying to
> > > >>> mediate/emulate, right?
> > > >>>
> > > >>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
> > > >>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
> > > >>> and QEMU just ignores the result. vhost doesn't necessarily depend on
> > > >>> it to determine endianness it looks.
> > > >> I would like to suggest to add two new config ops get/set_vq_endian()
> > > >> and get/set_config_endian() for vDPA. This is used to:
> > > >> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
> > > >> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
> > > >> If the device has not implemented interface to set its endianess, then
> > > >> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
> > > >> anyway.
> > > > How can Qemu know the endian in this way? And if it can, there's no
> > > > need for the new API?
> > > >
> > > >> In this case, if the device endianess does not match the guest,
> > > >> there needs a mediation layer or fail.
> > > >> b) ops->get_config_endian() can always tell the endian-ness of the
> > > >> device config space after the vendor driver probing the device. So we
> > > >> can use this ops->get_config_endian() for
> > > >> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
> > > >> don't need to set_features in vdpa_get_config_unlocked(), so no race
> > > >> conditions.
> > > >> Every time ops->get_config() returned, we can tell the endian by
> > > >> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features
> > > >> negotiation not done.
> > > >>
> > > >> The question is: Do we need two pairs of ioctls for both vq and config
> > > >> space? Can config space endian-ness differ from the vqs?
> > > >> c) do we need a new netlink attr telling the endian-ness to user space?
> > > > Generally, I'm not sure this is a good design consider it provides neither:
> > > >
> > > > Compatibility with the virtio spec
> > > >
> > > > nor
> > > >
> > > > Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)
> > > >
> > > > Thanks
> > > >
> > > >> Thanks,
> > > >> Zhu Lingshan
> > > >>>> or
> > > >>>>
> > > >>>> 3) revisit the idea of forcing modern only device which may simplify
> > > >>>> things a lot
> > > >>> I am not actually against forcing modern only config space, given that
> > > >>> it's not hard for either QEMU or individual driver to mediate or
> > > >>> emulate, and for the most part it's not conflict with the goal of
> > > >>> offload or acceleration with vDPA. But forcing LE ring layout IMO
> > > >>> would just kill off the potential of a very good use case. Currently
> > > >>> for our use case the priority for supporting 0.9.5 guest with vDPA is
> > > >>> slightly lower compared to live migration, but it is still in our TODO
> > > >>> list.
> > > >>>
> > > >>> Thanks,
> > > >>> -Siwei
> > > >>>
> > > >>>> which way should we go?
> > > >>>>
> > > >>>>> I hope
> > > >>>>> it's not the latter, otherwise it loses the point to use vDPA for
> > > >>>>> datapath acceleration.
> > > >>>>>
> > > >>>>> Even if its the former, it's a little weird for vendor device to
> > > >>>>> implement a LE config space with BE ring layout, although still
> > > >>>>> possible...
> > > >>>> Right.
> > > >>>>
> > > >>>> Thanks
> > > >>>>
> > > >>>>> -Siwei
> > > >>>>>> Thanks
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>>> So if you think we should add a vdpa_ops->get_endian(),
> > > >>>>>>>> I will drop these comments in the next version of
> > > >>>>>>>> series, and work on a new patch for get_endian().
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Zhu Lingshan
> > > >>>>>>> Guests don't get endian-ness from devices so this seems pointless.
> > > >>>>>>>
> > >
>