Re: [PATCH V2 2/3] vdpa_sim_net: support feature provisioning

Jason Wang <jasowang@xxxxxxxxxx> · Tue, 18 Oct 2022 15:45:39 +0800

在 2022/10/18 02:43, Si-Wei Liu 写道:


On 10/13/2022 12:10 AM, Jason Wang wrote:

在 2022/10/7 08:35, Si-Wei Liu 写道:


On 9/28/2022 9:55 PM, Jason Wang wrote:
On Tue, Sep 27, 2022 at 5:41 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> 
wrote:


On 9/26/2022 8:59 PM, Jason Wang wrote:

On Tue, Sep 27, 2022 at 9:02 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> 
wrote:


On 9/26/2022 12:11 AM, Jason Wang wrote:

On Sat, Sep 24, 2022 at 4:01 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> 
wrote:


On 9/21/2022 7:43 PM, Jason Wang wrote:

This patch implements features provisioning for vdpa_sim_net.

1) validating the provisioned features to be a subset of the parent
     features.
2) clearing the features that is not wanted by the userspace

For example:

# vdpa mgmtdev show
vdpasim_net:
    supported_classes net
    max_supported_vqs 3
    dev_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT 
VERSION_1 ACCESS_PLATFORM

Sighs, not to blame any one and it's perhaps too late, but this
"dev_features" attr in "mgmtdev show" command output should have been
called "supported_features" in the first place.

Not sure I get this, but I guess this is the negotiated features 
actually.

Actually no, that is why I said the name is a bit confusing and 
"supported_features" might sound better.

You're right, it's an mgmtdev show actually.

This attribute in the parent device (mgmtdev) denotes the real 
device capability for what virtio features can be supported by the 
parent device. Any unprivileged user can check into this field to 
know parent device's capability without having to create a child 
vDPA device at all. The features that child vDPA device may 
support should be a subset of, or at most up to what the parent 
device offers. For e.g. the vdpa device dev1 you created below can 
expose less or equal device_features bit than 0x308820028 (MTU MAC 
CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT VERSION_1 ACCESS_PLATFORM), but 
shouldn't be no more than what the parent device can actually 
support.

Yes, I didn't see anything wrong with "dev_features",

Yep, it didn't appear to me anything wrong either at first sight, 
then I gave my R-b on the series introduced this attribute. But 
it's not a perfect name, either, on the other hand. Parav later 
pointed out that the corresponding enum definition for this 
attribute should follow pre-existing naming convention that we 
should perhaps do 
s/VDPA_ATTR_DEV_SUPPORTED_FEATURES/VDPA_ATTR_MGMTDEV_SUPPORTED_FEATURES/ 
to get it renamed, as this is a mgmtdev level attribute, which I 
agree. Now that with the upcoming "device_features" attribute 
(vdpa dev level) from this series, it's subject to another 
confusions between these two similar names, but actually would 
represent things at different level. While all other attributes in 
"mgmtdev dev show" seem to be aligned with the "supported_" 
prefix, e.g. supported_classes, max_supported_vqs, from which I 
think the stance of device is already implied through "mgmtdev" in 
the command. For the perspective of clarify and easy distinction, 
"supported_features" seems to be a better name than "dev_features".
See another reply, I think I get your point,

1) VDPA_ATTR_VDPA_DEV_SUPPORTED_FEATURES (lingshan's series) and
VDPA_ATTR_VDPA_DEV_FEATURES should be equivalent and unified to a
single attribute.
2) A better name to "supported_features" should be fine, patches 
are welcomed

  it aligns to the
virtio spec which means the features could be used to create a vdpa
device. But if everyone agree on the renaming, I'm fine.

Never mind, if it's late don't have to bother.



I think Ling Shan is working on reporting both negotiated features
with the device features.

Does it imply this series is connected to another work in 
parallel? Is it possible to add a reference in the cover letter?

I'm not sure, I remember Ling Shan did some work to not block the
config show in this commit:

commit a34bed37fc9d3da319bb75dfbf02a7d3e95e12de
Author: Zhu Lingshan <lingshan.zhu@xxxxxxxxx>
Date:   Fri Jul 22 19:53:07 2022 +0800

     vDPA: !FEATURES_OK should not block querying device config space

We need some changes in the vdpa tool to show device_features
unconditionally in the "dev config show" command.

That's true, I think I ever pointed it out to Lingshan before, 
that it's not needed to bother exposing those config space fields 
in "dev config show" output, if the only intent is for live 
migration of device features between nodes. For vDPA live 
migration, what cares most is those configuration parameters 
specified on vdpa creation, and userspace VMM (QEMU) is supposed 
to take care of saving and restoring live device states. I think 
it's easier to extend "vdpa dev show" output to include 
device_features and other config params as well, rather than count 
on validity of various config space fields.
Probably, but for the migration it's more about the ability of the
mgmtdev instead of the vDPA device itself I think.
If picking the appropriate device for migration is what it is 
concerned, it's subject to the capability that mgmtdev offers. 
That's true, for sure.

On the other hand, mgmt software would also need to take care of 
reconstructing the destination device with the same configuration as 
that at the source side. For example, a mgmtdev on source supports 
features A, B, C, D,  and destination mgmtdev supports features B, 
C, D, E. When vdpa device on the source is initially created with 
features B and C but without feature D (noted: creation with a 
subset of mgmtdev features was already supported before, and this 
series just makes it more explicit), the mgmt software is supposed 
to equally create a device with features B and C on dest, even 
though the destination may support feature D that the mgmtdev on 
both sides can support. My point is, we should have some flexibility 
on the mgmt software implementation that allows the mgmt software to 
easily tell apart the exact features (i.e. B and C in the above 
example) and the exact configuration a specific vdpa device was 
originally created with, via some simple query command. Having mgmt 
software to remember the vdpa creation args could work, but on the 
other hand it'd be nice to allow lightweight software implementation 
without having to persist the list of vdpa args and make vdpa tool 
more self-contained.

I will post a patch (series) shortly to demonstrate this idea. 
Basically, there's no actual need to mess around with those config 
space values for live migration. It was not built for that purpose.


Ok, let's see.




https://lore.kernel.org/virtualization/454bdf1b-daa1-aa67-2b8c-bc15351c1851@xxxxxxxxxx/ 


It's not just insufficient, but sometimes is incorrect to create 
vDPA device using the config space fields.  For instance, MAC 
address in config space can be changed temporarily (until device 
reset) via ctrl_vq VIRTIO_NET_CTRL_MAC_ADDR_SET command. It's 
incorrect to create vDPA using the MAC address shown in the config 
space.
I think it's still a must for create the mac with the exact mac 
address:

1) VIRTIO_NET_F_CTRL_MAC is not a must
2) there's no way for us to know whether or not the mac has been 
changed
Noted I think here we are still talking about VERSION_1 device which 
is spec conforming. So far as I understand the spec, if the 
VIRTIO_NET_F_CTRL_MAC feature is not negotiated, there's no way for 
driver to change the default MAC address.


For 1.0 device yes.



Even if we want to simulate or support a legacy device model, when 
MAC address is changed by legacy driver somehow, QEMU should be able 
to detect this and issue a vdpa ioctl call to change the MAC address 
filter underneath. I don't see that it ever happens in today's code, 
so I presume the only possibility this may work is that the specific 
vDPA device may have an internal learning bridge that adapts to what 
MAC address the driver actually sends. 


This is not true AFAIK. E.g when switchdev is enabled for mlx5 parent.
Hmmm, I guess you mean switchdev mode with external learning bridge 
software e.g. Open vSwitch? It's conceptionally the same with device 
internal learning bridge, no?


I was told by Eli that when switchdev is enabled there will be no 
learning bridge. I might be wrong, cc Eli for more comments.



OK, though the point was that QEMU should anyway notify the backend of 
the mac address change for vDPA driver to apply the new MAC filter, 
similar to the way how CTRL_MAC ctrl_vq command is doing.


Yes.


It should not blindly assume that the every vDPA hardware may have 
underlying learning bridge construct, being internal or external. 
Basically it's not a universal assumption on the existence of learning 
bridge, this won't work for any other vDPA NIC without switchdev or 
learning bridge.



In this case, since the device doesn't care, recreate with the MAC 
in use is not needed, and technically it is even incorrect. In data 
centers or cloud environment, MAC address is usually controlled and 
managed by some central entity/service. If a driver can dominate the 
MAC address in use by deliberately overriding the default MAC and 
bypassing the central rule via live migration, that'd be a more 
severe security issue to address in the first place.


There used to be a discussion to allow trust and spoof check as what 
SR-IOV did. For safety, we can filter out CTRL_MAC right now. But I 
think it's something out of the scope for this discussion.
Sorry I don't get what you mean here, but I guess we may talk about 
different thing here (it seems you talked about trusted model that 
allows any MAC address, but it's orthogonal to programming MAC address 
filter to the underlying hardware as far as I understand).


Probably, what I meant is that, most SRIOV vendor allow to forbid change 
VF mac addresses and other filter for safety.




But I still don't get what's wrong with have the same mac address 
provisioned in both src and dst.
It looks like we may have misunderstood each other - that's exactly 
the point I want to make. The persistent mac address provisioned in 
dst by the mgmt software should stay the same with that on src, which 
is the default mac address rather than whatever other (temporary) mac 
address the VM might be using at the time of migration switchover at 
the source.


Yes, I think we are at the same page now.



It is the model used currently (e.g libvirt will guarantee the same 
mac in both src and dst). The mgmt software can then guarantee that 
the mac was fetched from the centralized manager.
Right, this is exactly the way our mgmt software works.

And we can't depend purely on the migration since in some case it 
can't work: e.g src MTU 4000 dst MTU 1500, migration will fail and 
mgmt stack need to provision an 4000 to work.
This seems like a bug of libvirt. For our case, we strictly prohibit 
unequal MTU on src & dst to work. Even migrating from MTU 1500 to 
4000, it effectively changes the underlying behavior for packet 
dropping and network setup at which maximum size the packet should be 
allowed when entering the VM.


Yes, this means the management layer should guarantee exact the same 
attribute for vDPA provisioned in both source and destination.





3) migration code can restore the mac during restore

So exactly the same mac address is still required. (This is the same
as we are doing for live migration on software virtio)

  Another example, if the source vDPA device has MAC address table 
size limit of 100, then in the destination we should pick parent 
device with size limit no smaller than that, and create vDPA on 
remote node matching the exact same size. There's nothing config 
space field can assist here.
Two ways:

1) mgmtdev should show the mac table size so mgmt layer can provision
the mac table size
2) If the mac table exceeds what is supported in the destination, it
needs to enable the all uni in this case.
Yep, so no field in the config space can help with these two 
solutions, right? MAC table size is not showing up there. Whether 
the parent device supports ALLUNI via VIRTIO_NET_F_CTRL_RX is not 
there, either. (showing them in the 'vdpa mgmtdev show' output is 
the right thing IMHO).


One example further, in the future, if we are going to introduce 
mandatory feature (for e.g. VERSION_1, RING_PACKED) that the 
device is unable to support the opposite case, the destination 
device should be created with equally same mandatory device 
features, which only vDPA creation parameters should matter. While 
I can't think of a case that the mgmt software or live migration 
tool would have to count on config space fields only.
Yes, in this case we need to introduce new netlink attributes for both
getting mandatory features from the management device and provisioning
those manadating features.
True, management device level thing again, not related to anything 
in the config space.





1) provision vDPA device with all features that are supported by the
     net simulator

# vdpa dev add name dev1 mgmtdev vdpasim_net
# vdpa dev config show
dev1: mac 00:00:00:00:00:00 link up link_announce false mtu 1500
    negotiated_features MTU MAC CTRL_VQ CTRL_MAC_ADDR VERSION_1 
ACCESS_PLATFORM

Maybe not in this patch, but for completeness for the whole series,
could we also add device_features to the output?

Lingshan, could you please share your thoughts or patch on this?

Noted here the device_features argument specified during vdpa 
creation is introduced by this series itself, it somehow slightly 
changed the original semantics of what device_features used to be.

I'm not sure I get this, we don't support device_features in the past
and it is used to provision device features to the vDPA device which
seems to be fine.

Before this change, only look at the dev_features in "mgmtdev 
show" and remember creation parameters is sufficient to get to all 
needed info for creating vDPA at destination.
Note that even with the same vendor, mgmtdev may support different 
features.

After this change, dev_features in "mgmtdev show" becomes less 
relevant, as it would need to remember vdpa creation parameters 
plus the device_features attribute. While this series allows cross 
vendor live migration, it would complicate the implementation of 
mgmt software, on the other hand.
To allow cross vendor live migration I couldn't find a better way.
The idea itself is great, I think, though the CLI interface may have 
some space for improvement. For example, user has to supply the 
heximal value consisting of each feature bit, which is a bit 
challenging for normal users who are not super familiar with each 
virtio feature. On the other hand, there could be ambiguity against 
other vdpa create option, e.g. users may do "vdpa dev add name vdpa0 
mgmtdev ... mtu 1500 device_features 0x300020000" (no F_MTU feature 
bit in device_features) that needs special validation in the code.


We can accept e.g XML in the future I think.
Regardless of XML being a good interface or not for end users, but I 
don't see how it relates to the issue here i.e. conflict/ambiguity or 
extra validation in the (kernel) code.


It's only about choosing a suitable interface between vdpa tool and 
mangment layer instead of solely depending on the command line arguments 
since we may support a lot of different features in the future.






How about we follow the CPU flags model or QEMU virtio-net-pci args 
to define property representing each feature bit? I think the 
current convention for each "vdpa dev add" option implies that the 
corresponding feature bit will be enabled once specified in 
creation. Similarly we can introduce additional option representing 
each feature bit, along with a new features_default property 
denoting the initial value for device_feature bits:

# vdpa mgmtdev show
vdpasim_net:
  supported_classes net
  max_supported_vqs 3
  dev_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT VERSION_1 
ACCESS_PLATFORM
# vdpa dev add name dev1 mgmtdev vdpasim_net features_default off \
                          csum on guest_csum on mtu 2000 ctrl_vq on 
version1 on access_platform on
# vdpa dev show
dev1: type network mgmtdev vdpasim_net vendor_id 0 max_vqs 3 
max_vq_size 256
   features_default off mtu 2000
   device_features CSUM GUEST_CSUM MTU CTRL_VQ VERSION_1 
ACCESS_PLATFORM

If the features_default property is left unspecified or with the 
"inherited" value, it would just inherit all of the 
supported_features from mgmtdev (which is already the case of today).

# vdpa dev add name dev1 mgmtdev vdpasim_net features_default inherited
# vdpa dev show
dev1: type network mgmtdev vdpasim_net vendor_id 0 max_vqs 3 
max_vq_size 256
  features_default inherited
  device_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT VERSION_1 
ACCESS_PLATFORM

I can definitely help implement this model if you find it fits.


I prefer XML since it could be reused and we may exceed 64bit 
limitation in the future. But we can hear from others.
I don't see how this can be re-used by QEMU as QMP is not 
taking/exporting XML. Are we talking about libvirt here, which happens 
to be one amongst many others? My personal feeling is that not quite a 
lot of human end users (rather than management software or script) 
today would prefer using XML. Instead, like any other iproute utility, 
json seems to a more popular interface for script and mgmt software to 
consume, which vdpa tool supports natively already.

I think basically we can support two set of CLI interfaces, one 
friendly enough for human users, and another scriptable and parseable 
by management software users. IMO for now we should start to focus on 
the human oriented CLI design first. Whether XML v.s. json being an 
ideal interface for managment software would be another discussion 
that would need more inputs from the broader range of extended 
community, which is not worth distracting from.


That's fine, is the end user is more about a program but a human, json 
should be better.

Thanks








When simply look at the "vdpa dev config show" output, I cannot 
really
tell the actual device_features that was used in vdpa creation. 
For e.g.
there is a missing feature ANY_LAYOUT from negotiated_features 
compared
with supported_features in mgmtdev, but the orchestration software
couldn't tell if the vdpa device on destination host should be 
created
with or without the ANY_LAYOUT feature.

I think VERSION_1 implies ANY_LAYOUT.

Right, ANY_LAYOUT is a bad example. A good example might be that, 
I knew the parent mgmtdev on migration source node supports 
CTRL_MAC_ADDR, but I don't find it in negotiated_features.

I think we should use the features that we got from "mgmtdev show"
instead of "negotiated features".

That was how it's supposed to work previously, but with this 
series, I think the newly introduced device_features will be 
needed instead of the one in "mgmtdev show".
Just to clarify, there won't be a device_features in mgmtdev show
since it is device specific, each individual device can have its own
device features which are subset of what is supported by the mgmtdev.
Yep.


On the migration destination node, the parent device does support 
all features as the source offers, including CTRL_MAC_ADDR. What 
device features you would expect the mgmt software to create 
destination vdpa device with, if not otherwise requiring mgmt 
software to remember all the arguments on device creation?
So the provisioning in the destination should use exactly the same
device_feautres as what the vdpa device has in the source. But before
this, management layer should guarantee to provision a vDPA device
whose device_features can be supported on the destination in order to
allow live migration.
Exactly.

So in this example, we need use "dev_features" so we get exact the
same features after and operation as either src or dst.

If the device_features vDPA created with at the source doesn't 
include CTRL_MAC_ADDR even though parent supports it, then the 
vDPA to be created at the destination shouldn't come with 
CTRL_MAC_ADDR either, regardless of whether or not CTRL_MAC_ADDR 
is present in destination "mgmtdev show".

However, if just taking look at negotiated_features, some mgmt 
software implementations which don't persist the creation 
parameters can't get the device features a certain vDPA device at 
the source node was created with.


SOURCE# vdpa mgmtdev show
vdpasim_net:
    supported_classes net
    max_supported_vqs 3
    dev_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT 
VERSION_1 ACCESS_PLATFORM
SOURCE# vdpa dev config show
dev1: mac 00:00:00:00:00:00 link up link_announce false mtu 1500
    negotiated_features MTU MAC CTRL_VQ VERSION_1 ACCESS_PLATFORM

DESTINATION# vdpa mgmtdev show
vdpasim_net:
    supported_classes net
    max_supported_vqs 3
    dev_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT 
VERSION_1 ACCESS_PLATFORM

  But it should be sufficient to
use features_src & feature_dst in this case. Actually, it should work
similar as to the cpu flags, the management software should introduce
the concept of cluster which means the maximal set of common features
is calculated and provisioned during device creation to allow
migration among the nodes inside the cluster.

Yes, this is one way mgmt software may implement, but I am not 
sure if it's the only way. For e.g. for cpu flags, mgmt software 
can infer the guest cpus features in use from all qemu command 
line arguments and host cpu features/capability, which doesn't 
need to remember creation arguments and is easy to recover from 
failure without having to make the VM config persistent in data 
store. I thought it would be great if vdpa CLI design could offer 
the same.

One minor difference is that we have cpu model abstraction, so we can
have things like:

./qemu-system-x86_64 -cpu EPYC

Which implies the cpu features/flags where vDPA doesn't have. But
consider it's just a 64bit (or 128 in the future), it doesn't 
seems to
be too complex for the management to know, we probably need to start
from this and then we can try to introduce some generation/model 
after
it is agreed on most of the vendors.

What you refer to is the so-called named model for CPU flags. I 
think it's a good addition to have some generation or named model 
defined for vDPA. But I don't get the point for how it relates to 
exposing the actual value of device features? Are you saying in 
this case you'd rather expose the model name than the actual value 
of feature bits? Well, I think we can expose both in different 
fields when there's really such a need.
It's something like:

vdpa dev add name dev1 mgmtdev vdpasim_net device_features 
VDPA_NET_MODEL_1

and VDPA_NET_MODEL_1 implies some feature sets.

Not sure if this could be very useful for virtio devices, given we 
don't have a determined set of virtio features unlike CPU 
generation/model, but I think even with the features_default 
property we can still achieve that.

-Siwei


Yes.
Let me get some time to implement and post the relevant patches 
(mostly in iproute) later. Basically this would supplement those 
config attributes introduced through the 'vdpa dev show' series below 
[1] with provisioned device features to reference:

[1] 
https://lore.kernel.org/virtualization/1665793690-28120-1-git-send-email-si-wei.liu@xxxxxxxxxx/


Thanks,
-Siwei


Thanks




BTW with regard to the cpu model in mgmt software implementation, 
the one implemented in libvirt is a mixed "Host model" [1] with 
taking advantage of QEMU named model and exposing additional 
individual CPU features that gets close to what host CPU offers.
So creating vDPA device without "device_features" is somehow the host
model, it will have all features that is supported by the parent.

I think this implies that mgmt software should have to understand 
what the model name really means in terms of individual CPU 
features, so having feature bit value exposed will just do more 
help if vDPA goes the same way.
Exactly.

Thanks


Regards,
-Siwei

[1] 
https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html#two-ways-to-configure-cpu-models-with-qemu-kvm


Thanks

Thanks,
-Siwei


Thanks

Thanks,
-Siwei


2) provision vDPA device with a subset of the features

# vdpa dev add name dev1 mgmtdev vdpasim_net device_features 
0x300020000
# vdpa dev config show
dev1: mac 00:00:00:00:00:00 link up link_announce false mtu 1500
    negotiated_features CTRL_VQ VERSION_1 ACCESS_PLATFORM

Reviewed-by: Eli Cohen <elic@xxxxxxxxxx>
Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx>
---
   drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 11 ++++++++++-
   1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c 
b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
index 886449e88502..a9ba02be378b 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim_net.c
@@ -254,6 +254,14 @@ static int vdpasim_net_dev_add(struct 
vdpa_mgmt_dev *mdev, const char *name,
       dev_attr.work_fn = vdpasim_net_work;
       dev_attr.buffer_size = PAGE_SIZE;

+     if (config->mask & BIT_ULL(VDPA_ATTR_DEV_FEATURES)) {
+             if (config->device_features &
+                 ~dev_attr.supported_features)
+                     return -EINVAL;
+             dev_attr.supported_features &=
+                      config->device_features;
+     }
+
       simdev = vdpasim_create(&dev_attr);
       if (IS_ERR(simdev))
               return PTR_ERR(simdev);
@@ -294,7 +302,8 @@ static struct vdpa_mgmt_dev mgmt_dev = {
       .id_table = id_table,
       .ops = &vdpasim_net_mgmtdev_ops,
       .config_attr_mask = (1 << VDPA_ATTR_DEV_NET_CFG_MACADDR |
-                          1 << VDPA_ATTR_DEV_NET_CFG_MTU),
+                          1 << VDPA_ATTR_DEV_NET_CFG_MTU |
+                          1 << VDPA_ATTR_DEV_FEATURES),
       .max_supported_vqs = VDPASIM_NET_VQ_NUM,
       .supported_features = VDPASIM_NET_FEATURES,
   };







_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization