On Thu, Oct 19, 2023 at 7:21 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote: > > > > On 10/18/2023 4:14 AM, Eugenio Perez Martin wrote: > > On Wed, Oct 18, 2023 at 10:44 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote: > >> > >> > >> On 10/17/2023 10:27 PM, Jason Wang wrote: > >>> On Wed, Oct 18, 2023 at 12:36 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote: > >>>> > >>>> On 10/16/2023 7:35 PM, Jason Wang wrote: > >>>>> On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote: > >>>>>> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote: > >>>>>>> On Mon, Oct 16, 2023 at 8:33 AM Jason Wang <jasowang@xxxxxxxxxx> wrote: > >>>>>>>> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote: > >>>>>>>>> On 10/12/2023 8:01 PM, Jason Wang wrote: > >>>>>>>>>> On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu <si-wei.liu@xxxxxxxxxx> wrote: > >>>>>>>>>>> Devices with on-chip IOMMU or vendor specific IOTLB implementation > >>>>>>>>>>> may need to restore iotlb mapping to the initial or default state > >>>>>>>>>>> using the .reset_map op, as it's desirable for some parent devices > >>>>>>>>>>> to solely manipulate mappings by its own, independent of virtio device > >>>>>>>>>>> state. For instance, device reset does not cause mapping go away on > >>>>>>>>>>> such IOTLB model in need of persistent mapping. Before vhost-vdpa > >>>>>>>>>>> is going away, give them a chance to reset iotlb back to the initial > >>>>>>>>>>> state in vhost_vdpa_cleanup(). > >>>>>>>>>>> > >>>>>>>>>>> Signed-off-by: Si-Wei Liu <si-wei.liu@xxxxxxxxxx> > >>>>>>>>>>> --- > >>>>>>>>>>> drivers/vhost/vdpa.c | 16 ++++++++++++++++ > >>>>>>>>>>> 1 file changed, 16 insertions(+) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c > >>>>>>>>>>> index 851535f..a3f8160 100644 > >>>>>>>>>>> --- a/drivers/vhost/vdpa.c > >>>>>>>>>>> +++ b/drivers/vhost/vdpa.c > >>>>>>>>>>> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, > >>>>>>>>>>> return vhost_vdpa_alloc_as(v, asid); > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) > >>>>>>>>>>> +{ > >>>>>>>>>>> + struct vdpa_device *vdpa = v->vdpa; > >>>>>>>>>>> + const struct vdpa_config_ops *ops = vdpa->config; > >>>>>>>>>>> + > >>>>>>>>>>> + if (ops->reset_map) > >>>>>>>>>>> + ops->reset_map(vdpa, asid); > >>>>>>>>>>> +} > >>>>>>>>>>> + > >>>>>>>>>>> static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) > >>>>>>>>>>> { > >>>>>>>>>>> struct vhost_vdpa_as *as = asid_to_as(v, asid); > >>>>>>>>>>> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) > >>>>>>>>>>> > >>>>>>>>>>> hlist_del(&as->hash_link); > >>>>>>>>>>> vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid); > >>>>>>>>>>> + /* > >>>>>>>>>>> + * Devices with vendor specific IOMMU may need to restore > >>>>>>>>>>> + * iotlb to the initial or default state which is not done > >>>>>>>>>>> + * through device reset, as the IOTLB mapping manipulation > >>>>>>>>>>> + * could be decoupled from the virtio device life cycle. > >>>>>>>>>>> + */ > >>>>>>>>>> Should we do this according to whether IOTLB_PRESIST is set? > >>>>>>>>> Well, in theory this seems like so but it's unnecessary code change > >>>>>>>>> actually, as that is the way how vDPA parent behind platform IOMMU works > >>>>>>>>> today, and userspace doesn't break as of today. :) > >>>>>>>> Well, this is one question I've ever asked before. You have explained > >>>>>>>> that one of the reason that we don't break userspace is that they may > >>>>>>>> couple IOTLB reset with vDPA reset as well. One example is the Qemu. > >>>>>>>> > >>>>>>>>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set > >>>>>>>>> it doesn't necessarily mean the iotlb will definitely be destroyed > >>>>>>>>> across reset (think about the platform IOMMU case), so userspace today > >>>>>>>>> is already tolerating enough with either good or bad IOMMU. > >>>>> I'm confused, how to define tolerating here? > >>>> Tolerating defined as QEMU has to proactively unmap before reset just to > >>>> workaround the driver bug (on-chip maps out of sync), unconditionally > >>>> for platform or on-chip. While we all know it doesn't have to do so for > >>>> platform IOMMU, though userspace has no means to distinguish. That said, > >>>> userspace is sacrificing reset time performance on platform IOMMU setup > >>>> just for working around buggy implementation in the other setup. > >>> Ok, so what you actually mean is that userspace can tolerate the "bug" > >>> with the performance penalty. > >> Right. > >>> > >>>>> For example, if it has tolerance, why bother? > >>>> I'm not sure I get the question. But I think userspace is compromising > >>>> because of buggy implementation in a few drivers doesn't mean we should > >>>> uniformly enforce such behavior for all set_map/dma_map implementations. > >>> This is not my point. I meant, we can fix we need a negotiation in > >>> order to let some "buggy" old user space to survive from the changes. > >> Userspace is no buggy today, how to define "buggy"? Userspace with > >> tolerance could survive just fine no matter if this negotiation or buggy > >> driver behavior emulation is around or not. If any userspace doesn't > >> tolerate, it can work still fine on good on-chip IOMMU or platform > >> IOMMU, no matter if the negotiation is around or not. > >>>>>>>> This code of > >>>>>>>>> not checking IOTLB_PERSIST being set is intentional, there's no point to > >>>>>>>>> emulate bad IOMMU behavior even for older userspace (with improper > >>>>>>>>> emulation to be done it would result in even worse performance). > >>>>> I can easily imagine a case: > >>>>> > >>>>> The old Qemu that works only with a setup like mlx5_vdpa. > >>>> Noted, seems to me there's no such case of a userspace implementation > >>>> that only works with mlx5_vdpa or its friends, but doesn't work with the > >>>> others e.g. platform IOMMU, or well behaving on-chip IOMMU > >>>> implementations. > >>> It's not hard to think of a case where: > >>> > >>> 1) the environment has mlx5_vdpa only > >>> 2) kernel doc can't have endless details, so when developing > >>> application, the author notice IOTLB is cleared during reset > >> I get it, but my question was that, even if the author had noticed IOTLB > >> is cleared during reset, does he care or not to make IOTLB back working > >> again? My point is that, if this old setup is supposed to "work" on > >> mlx5_vdpa, then the developer must come up with sort of "quirk" to > >> recover the IOTLB to make it back to working state again after the > >> reset. It will be more justified to come up with the proper fix for > >> compatibility/emulation only until we know what should be expected to > >> work and through which possible means to making it back to work, rather > >> than blindly emulate the buggy behavior solely based on a few driver's > >> own implementation. I'm pretty sure there are multiple ways to implement > >> the buggy reset behavior in the driver, does it mean we have to emulate > >> various corrupted mapping states in the individual on-chip iommu itself? > >> How is it able to help the developer user if we are able to replicate > >> the same corrupted mapping state in the on-chip iommu after reset, any > >> real-life user only cares about mapping being corrupted in the same way, > >> rather than cares more about the quirk sequence or work around to get > >> iotlb maps out of the broken state? > >> > >> Only if the userspace is like a test facility to expect some test case > >> to fail on mlx5_vdpa after reset -- I assume that is not real-life user > >> at all. > >>>> The Unmap+remap trick around vdpa reset works totally > >>>> fine for platform IOMMU, except with sub-optimal performance. Other than > >>>> this trick, I cannot easily think of other means or iotlb message > >>>> sequence for userspace to recover the bogus state and make iotlb back to > >>>> work again after reset. > >>> Yes for sure, but we can't audit every user space, no? > >> We don't have to, as userspace here has no bug at all. The bug exists in > >> the driver not in userspace. Real life userspace app only cares about > >> making things work not asserting something must be broken. > >>>> Are we talking about hypnosis that has no real > >>>> basis to exist in the real world? > >>> Instead of trying to answer these hard questions, I would go another > >>> way. That is, stick to the old behaviour when IOTLB_PRESISIT is not > >>> set by the backend. This is much easier. > >> Please be noted the old (broken) behavior can vary between different > >> parent driver implementations. It's driver's specific own problem, if > >> there are N ways to for driver to implement buggy .reset, do we have to > >> emulate N flavors of different vdpa reset behavior? > >> > >>>>> If we do > >>>>> this without a negotiation, IOTLB will not be clear but the Qemu will > >>>>> try to re-program the IOTLB after reset. Which will break? > >>>>> > >>>>> 1) stick the exact old behaviour with just one line of check > >>>> It's not just one line of check here, the old behavior emulation has to > >>>> be done as Eugenio illustrated in the other email. > >>> For vhost-vDPA it's just > >>> > >>> if (IOTLB_PERSIST is acked by userspace) > >>> reset_map() > >>> > >>> For parent, it's somehow similar: > >>> > >>> during .reset() > >>> > >>> if (IOTLB_PERSIST is not acked by userspace) > >>> reset_vendor_mappings() > >>> > >>> Anything I missed here? > >> First, the ideal fix would be to leave this reset_vendor_mappings() > >> emulation code on the individual driver itself, which already has the > >> broken behavior. But today there's no backend feature negotiation > >> between vhost-vdpa and the parent driver. Do we want to send down the > >> acked_backend_features to parent drivers? > >> > > What if we add a module parameter to both mlx5 and vdpa_sim to keep > > the old behavior? Let's call it clean_iotlb_on_reset for now. > > > > In my opinion we can leave it off by default, so these userspace apps > > can get back to the previous behavior. It would be ideal if we set a > > deprecation date for it though. > > > > This way new backends, whether they implement .set_map or not, will > > have correct behavior. > > > > Would that work? > Great idea, this definitely will work! With this module parameter, > individual driver still keeps the possibility to revert to previous > buggy behavior were to unbreak old userspace, code can be obsoleted > independently per each driver's specific use case and need, and we don't > necessarily overload vdpa core with too much unwarranted compatibility > code. Thank you so much for the great suggestion, I disagree, module parameters have been proved to be very hard for management. And what I don't understand here is, once a module parameter can work, why not just replace it with the checking of IOTLB_PRESIST? Thanks > I will post a v3. > > Thanks, > -Siwei > > > > > Thanks! > > > >> Second, IOTLB_PERSIST is needed but not sufficient. Due to lack of > >> backend feature negotiation in parent driver, if vhost-vdpa has to > >> provide the old-behaviour emulation for compatibility on driver's > >> behalf, it needs to be done per-driver basis. There could be good > >> on-chip or vendor IOMMU implementation which doesn't clear the IOTLB in > >> .reset, and vendor specific IOMMU doesn't have to provide .reset_map, we > >> should allow these good driver implementations rather than > >> unconditionally stick to some specific problematic behavior for every > >> other good driver. Then we need a set of device flags (backend_features > >> bit again?) to indicate the specific driver needs upper layer's help on > >> old-behaviour emulation. > >> > >> Last but not least, I'm not sure how to properly emulate > >> reset_vendor_mappings() from vhost-vdpa layer. If a vendor driver has no > >> .reset_map op implemented, or if .reset_map has a slightly different > >> implementation than what it used to reset the iotlb in the .reset op, > >> then this either becomes effectively dead code if no one ends up using, > >> or the vhost-vdpa emulation is helpless and limited in scope, unable to > >> cover all the cases. > >> > >> > >>>> In addition, the > >>>> emulation has to limit to those buggy drivers as I don't feel this > >>>> emulation should apply uniformly to all future set_map/dma_map > >>>> implementations. > >>> Unfortunately, it's a must to stick to ABI. > >> How come this brokenness in mlx5_vdpa becomes ABI in any sort for future > >> on-chip IOMMU drivers? They might not even exist yet. Even if it's > >> concerning ABI it's limited to mlx5_vdpa and the existing drivers, right? > >> > >>> I agree it's a mess but we don't have a better choice. > >> Well, it's your call, I can implement as you wish but the unwarranted > >> code has to be maintained forever. Particularly without knowing if > >> there's really such a use case in real life, and no one in future might > >> dare to remove the code without knowing what it can be used for. > >> > >>> Or we can fail the probe if userspace > >>> doesn't ack this feature. > >> Fail probing is even worse choice that is introducing intrusive breakage > >> to the userspace. > >>>>> 2) audit all the possible cases to avoid a one line of code > >>>>> > >>>>> 1) seems much easier than 2) > >>>> You see it's more than just one line of code, and I'm uncertain if the > >>>> additional complexity is warranted or necessary, particularly if added > >>>> this piece of compatibility code will linger for quite a long time. > >>> This is a must as long as it can be noticed by userspace. Doing > >>> something conservative makes more sense to me. > >>> > >>>> Instead of adding hypothetical code change for no specific good reason > >>>> and no real use case, > >>> It's not adding something new or new behaviours, it's just making the > >>> IOTLB reset conditional based on vDPA reset. > >>> > >>>> I'd like to add the code when we find out a > >>>> specific use case that may get impacted or already being affected, > >>> It doesn't conflict with what you proposed here. Old behaviours have > >>> their users, no? > >> We don't know the use case how to make thing work instead of make thing > >> break, that is the problem. We have no way to test if old-behaviour > >> preserving code really works as expected. If there's no such user in > >> practice, it ends up with dead code no one dares to remove. > >>>> then > >>>> we will have good understanding how to code up the fix and emulate > >>>> properly for compatibility, while not affecting other good implementations. > >>> The issue is, even if we can't find a userspace now. It doesn't mean > >>> we can't have one in the future. Then it might be too late or too > >>> tricky to fix them. We had a lot of lessons in the past. > >> I am not sure the same situation "too late to fix" or "too tricky to > >> fix" applies here. Usually this means there's some well established > >> pattern for e.g. API, ABI or long standing de-factor behavior that can't > >> be broken or adjust if trying to fix something up. But here we're > >> guarded by a flag (IOTLB_PERSIST) and without it the behavior is totally > >> ruled by implementation. > >> > >> Regards, > >> -Siwei > >> > >>> Thanks > >>> > >>>> Thanks, > >>>> -Siwe/i/ > >>>> > >>>>>>>> For two reasons: > >>>>>>>> > >>>>>>>> 1) backend features need acked by userspace this is by design > >>>>>>>> 2) keep the odd behaviour seems to be more safe as we can't audit > >>>>>>>> every userspace program > >>>>>>>> > >>>>>>> The old behavior (without flag ack) cannot be trusted already, as: > >>>>> Possibly but the point is to unbreak userspace no matter how weird the > >>>>> behaviour we've ever had. > >>>>> > >>>>>>> * Devices using platform IOMMU (in other words, not implementing > >>>>>>> neither .set_map nor .dma_map) does not unmap memory at virtio reset. > >>>>>>> * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do > >>>>>>> reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb > >>>>>>> called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset, > >>>>>>> now all backends work the same as far as I know., which was (and is) > >>>>>>> the way devices using the platform IOMMU works. > >>>>>>> > >>>>>>> The difference in behavior did not matter as QEMU unmaps all the > >>>>>>> memory unregistering the memory listener at vhost_vdpa_dev_start(..., > >>>>>>> started = false), > >>>>>> Exactly. It's not just QEMU, but any (older) userspace manipulates > >>>>>> mappings through the vhost-vdpa iotlb interface has to unmap all > >>>>>> mappings to workaround the vdpa parent driver bug. > >>>>> Just to clarify, from userspace, it's the (odd) behaviour of the current uAPI. > >>>>> > >>>>>> If they don't do > >>>>>> explicit unmap, it would cause state inconsistency between vhost-vdpa > >>>>>> and parent driver, then old mappings can't be restored, and new mapping > >>>>>> can be added to iotlb after vDPA reset. There's no point to preserve > >>>>>> this broken and inconsistent behavior between vhost-vdpa and parent > >>>>>> driver, as userspace doesn't care at all! > >>>>> It's a userspace notice change so we can't fix it silently: > >>>>> > >>>>> https://lkml.org/lkml/2012/12/23/75 > >>>>> > >>>>> Another example which is related to vhost-vDPA: > >>>>> > >>>>> https://lore.kernel.org/netdev/20230927140544.205088-1-eric.auger@xxxxxxxxxx/T/ > >>>>> > >>>>> Thanks > >>>>> > >>>>>>> but the backend acknowledging this feature flag > >>>>>>> allows QEMU to make sure it is safe to skip this unmap & map in the > >>>>>>> case of vhost stop & start cycle. > >>>>>>> > >>>>>>> In that sense, this feature flag is actually a signal for userspace to > >>>>>>> know that the bug has been solved. > >>>>>> Right, I couldn't say it better than you do, thanks! The feature flag is > >>>>>> more of an unusual means to indicating kernel bug having been fixed, > >>>>>> rather than introduce a new feature or new kernel behavior ending up in > >>>>>> change of userspace's expectation. > >>>>>> > >>>>>>> Not offering it indicates that > >>>>>>> userspace cannot trust the kernel will retain the maps. > >>>>>>> > >>>>>>> Si-Wei or Dragos, please correct me if I've missed something. Feel > >>>>>>> free to use the text in case you find more clear in doc or patch log. > >>>>>> Sure, will do, thank you! Will post v2 adding these to the log. > >>>>>> > >>>>>> Thanks, > >>>>>> -Siwei > >>>>>> > >>>>>> > >>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>>> Thanks > >>>>>>>> > >>>>>>>>> I think > >>>>>>>>> the purpose of the IOTLB_PERSIST flag is just to give userspace 100% > >>>>>>>>> certainty of persistent iotlb mapping not getting lost across vdpa reset. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> -Siwei > >>>>>>>>> > >>>>>>>>> [1] > >>>>>>>>> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e47fd@xxxxxxxxxx/ > >>>>>>>>> [2] > >>>>>>>>> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1f2e@xxxxxxxxxx/ > >>>>>>>>>> Otherwise > >>>>>>>>>> we may break old userspace. > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> > >>>>>>>>>>> + vhost_vdpa_reset_map(v, asid); > >>>>>>>>>>> kfree(as); > >>>>>>>>>>> > >>>>>>>>>>> return 0; > >>>>>>>>>>> -- > >>>>>>>>>>> 1.8.3.1 > >>>>>>>>>>> > _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization