On Tue, 16 Nov 2021 21:48:31 -0400 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Tue, Nov 16, 2021 at 02:10:31PM -0700, Alex Williamson wrote: > > On Tue, 16 Nov 2021 15:25:05 -0400 > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > On Tue, Nov 16, 2021 at 10:57:36AM -0700, Alex Williamson wrote: > > > > > > > > I think userspace should decide if it wants to use mlx5 built in or > > > > > the system IOMMU to do dirty tracking. > > > > > > > > What information does userspace use to inform such a decision? > > > > > > Kernel can't know which approach performs better. Operators should > > > benchmark and make a choice for their deployment HW. Maybe device > > > tracking severely impacts device performance or vice versa. > > > > I'm all for keeping policy decisions out of the kernel, but it's pretty > > absurd to expect a userspace operator to benchmark various combination > > and wire various knobs through the user interface for this. > > Huh? This is the standard operating procedure in netdev land, mm and > others. Max performance requires tunables. And here we really do care > alot about migration time. I've seen the mm complexity supporting the > normal vCPU migration to know this is a high priority for many people. Per previous reply: "Maybe start with what uAPI visible knobs really make sense and provide a benefit for per-device dirty bitmap tuning and how a device agnostic userspace like QEMU is going to make intelligent decisions about those knobs." > > It seems to me that the kernel, ie. the vfio variant driver, *can* > > know the best default. > > If the kernel can have an algorithm to find the best default then qemu > can implement the same algorithm too. I'm also skeptical about this > claim the best is knowable. QEMU is device agnostic and has no visibility to the system IOMMU capabilities. > > > Kernel doesn't easily know what userspace has done, maybe one device > > > supports migration driver dirty tracking and one device does not. > > > > And that's exactly why the current type1 implementation exposes the > > least common denominator to userspace, ie. pinned pages only if all > > devices in the container have enabled this degree of granularity. > > The current implementation forces no dirty tracking. That is a policy > choice that denies userspace the option to run one device with dirty > track and simply suspend the other. "Forces no dirty tracking"? I think you're suggesting that if a device within an IOMMU context doesn't support dirty tracking that we're forced to always report all mappings within that context as dirty, because for some reason we're not allowed to evolve how devices can interact with a shared dirty context, but we are allowed to invent a new per-device uAPI? > > > So, I would like to see userspace control most of the policy aspects, > > > including the dirty track provider. > > > > This sounds like device specific migration parameter tuning via a > > devlink interface to me, tbh. How would you propose a generic > > vfio/iommufd interface to tune this sort of thing? > > As I said, if each page track provider has its own interface it is > straightforward to make policy in userspace. The only tunable beyond > that is the dirty page tracking granularity. That is already > provided by userspace, but not as an parameter during START. > > I don't see why we'd need something as complicated as devlink just > yet. The approaches aren't that different. It seems like you want userspace to be able to select which devices to get dirty page info from while I'm suggesting that we can develop ways that the user can manage how the device interacts with a shared dirty page state. Per-device dirty page state doesn't preclude a shared dirty state, we still expect that any IOMMU interface is the source of truth when we have multiple devices sharing an IOMMU context. What if we have a scenario where devices optionally have per device dirty page tracking. A) mlx5 w/ device level dirty page tracking, vGPU w/ page pinning vs B) mlx5 w/ device level dirty page tracking, NIC with system IOMMU dirty page tracking Compare and contrast in which case the shared IOMMU context dirty page tracking reports both device's versus only the devices without per-device tracking. Is this obvious to both userspace and the shared dirty page tracking if it's the same or different in both cases? > > > How does userspace know if dirty tracking works or not? All I see > > > VFIO_IOMMU_DIRTY_PAGES_FLAG_START unconditionally allocs some bitmaps. > > > > IIRC, it's always supported by type1. In the worst case we always > > report all mapped pages as dirty. > > Again this denies userspace a policy choice. Why do dirty tracking > gyrations if they don't work? Just directly suspend the devices and > then copy. As above, it seems like you're freezing one interface and allowing the other to evolve. We can create ways that a device without dirty tracking can be marked as not generating DMA within a shared dirty page context. If maybe what you're really trying to get at all along is visibility to the per-device dirty page contribution, that does sound interesting. But I also don't know how that interacts with system IOMMU based reporting. Does the IOMMU report to the driver that reports to userspace the per-device dirty pages? > > > I'm surprised it doesn't check that only NO_IOMMU's devices are > > > attached to the container and refuse to dirty track otherwise - since > > > it doesn't work.. > > > > No-IOMMU doesn't use type1, the ioctl returns errno. > > Sorry, I mistyped that, I ment emulated iommu, as HCH has called it: > > vfio_register_emulated_iommu_dev() Ok, so that's essentially the vfio IOMMU driver just keeping track of userspace mappings in order to provide page pinning, where a device supporting page pinning is our first step in more fine grained dirty tracking. Type1 keeps track of pages pinned across all devices sharing the IOMMU container, thus we have container based dirty page reporting. > > > When we mix dirty track with pre-copy, the progression seems to be: > > > > > > DITRY TRACKING | RUNNING > > > Copy every page to the remote > > > DT | SAVING | RUNNING > > > Copy pre-copy migration data to the remote > > > SAVING | NDMA | RUNNING > > > Read and clear dirty track device bitmap > > > DT | SAVING | RUNNING > > > Copy new dirtied data > > > (maybe loop back to NDMA a few times?) > > > SAVING | NDMA | RUNNING > > > P2P grace state > > > 0 > > > Read the dirty track and copy data > > > Read and send the migration state > > > > > > Can we do something so complex using only SAVING? > > > > I'm not demanding that triggering device dirty tracking on saving is > > how this must be done, I'm only stating that's an idea that was > > discussed. If we need more complicated triggers between the IOMMU and > > device, let's define those, but I don't see that doing so negates the > > benefits of aggregated dirty bitmaps in the IOMMU context. > > Okay. As far as your request to document things as we seem them > upcoming I belive we should have some idea how dirty tracking control > fits in. I agree that is not related to how the bitmap is reported. We > will continue to think about dirty tracking as not connected to > SAVING. Ok. > > > This creates inefficiencies in the kernel, we copy from the mlx5 > > > formed data structure to new memory in the iommu through a very > > > ineffficent API and then again we do an ioctl to copy it once more and > > > throw all the extra work away. It does not seem good for something > > > where we want performance. > > > > So maybe the dirty bitmaps for the IOMMU context need to be exposed to > > and directly modifiable by the drivers using atomic bitmap ops. Maybe > > those same bitmaps can be mmap'd to userspace. These bitmaps are not > > insignificant, do we want every driver managing their own copies? > > If we look at mlx5 for example there is no choice. The dirty log is in > some device specific format and is not sharable. We must allocate > memory to work with it. > > What I don't need is the bitmap memory in the iommu container, that is > all useless for mlx5. > > So, down this path we need some API for the iommu context to not > allocate its memory at all and refer to storage from the tracking > provider for cases where that makes sense. That depends whether there are other devices in the context and if the container dirty context is meant to include all devices or if the driver is opt'ing out of the shared tracking... if that's a thing. Alternatively, drivers could register callbacks to report their dirty pages into the shared tracking for ranges requested by the user. We need to figure out how per-device tracking an system IOMMU tracking get along if that's where we're headed. > > > Userspace has to do this anyhow if it has configurations with multiple > > > containers. For instance because it was forced to split the containers > > > due to one device not supporting NDMA. > > > > Huh? When did that become a requirement? > > It is not a requirement, it is something userspace can do, if it > wants. And we talked about this, if NDMA isn't supported the P2P can't > work, and a way to enforce that is to not include P2P in the IOMMU > mapping. Except if you have an asymmetric NDMA you may want an > asymmetric IOMMU mapping too where the NDMA devices can do P2P and the > others don't. That is two containers and two dirty bitmaps. I'm not sure how I was supposed to infer that use case from "...forced to split the containers due to one device not support NDMA". That's certainly an option, but not a requirement. In fact, if QEMU were to do something like that, then we have some devices that can do p2p and some devices that cannot... all or none seems like a much more deterministic choice for QEMU. How do guest drivers currently test p2p? > Who knows, it isn't the job of the kernel to make these choices, the > kernel just provides tools. Agreed, but I don't see that userspace choosing to use separate contexts either negates the value of the kernel aggregating dirty pages within a container or clearly makes the case for per-device dirty pages. > > I feel like there are a lot of excuses listed here, but nothing that > > really demands a per device interface, > > "excuses" and "NIH" is a bit rude Alex. > > From my side, you are asking for a lot work from us (who else?) to > define and implement a wack of missing kernel functionality. > > I think it is very reasonable to ask what the return to the community > is for this work. "makes more sense to me" is not something that > is really compelling. > > So, I'd rather you tell me concretely why doing this work, in this > way, is a good idea. On my end, we have a defined dirty page tracking interface at the IOMMU context with some, admittedly rough, plans on how we intend to evolve that interface for both device level and IOMMU level tracking feeding into a shared dirty page interfaces. The example scenarios presented mostly seem to have solutions within that design framework if we put a little effort into it. There are unanswered questions down either path, but one of those paths is the path we previously chose. The greater burden is to switch paths, so rather I need to understand why this is the better path, with due diligence to explore what the same looks like with the current design. It's only finally here in the thread that we're seeing some of the mlx5 implementation details that might favor a per-device solution, hints that per device page granularity might be useful, and maybe that exposing per-device dirty page footprints to userspace is underlying this change of course. > > > If the migration driver says it supports tracking, then it only tracks > > > DMA from that device. > > > > I don't see what this buys us. Userspace is only going to do a > > migration if all devices support the per device migration region. At > > that point we need the best representation of the dirty bitmap we can > > provide per IOMMU context. It makes sense to me to aggregate per > > device dirtying into that one context. > > Again, there are policy choices userspace can make, like just > suspending the device that doesn't do dirty tracking and continuing to > dirty track the ones that do. > > This might be very logical if a non-dirty tracking device is something > like IDXD that will just cause higher request latency and the dirty > tracking is mlx5 that will cause externally visable artifacts. > > My point is *we don't know* what people will want. > > I also think you are too focused on a one-size solution that fits into > a qemu sort of enteprise product. While I can agree with your position > relative to an enterprise style customer, NVIDIA has different > customers, many that have their own customized VMMs that are tuned for > their HW choices. For these customers I do like to see that the kernel > allows options, and I don't think it is appropriate to be so > dismissive of this perspective. So provide the justification I asked for previously and quoted above, what are the things we want to be able to tune that cannot be done through reasonable extensions of the current design? I'm not trying to be dismissive, I'm lacking facts and evidence of due diligence that the current design is incapable of meeting our goals. > > > What I see is a lot of questions and limitations with this > > > approach. If we stick to funneling everything through the iommu then > > > answering the questions seem to create a large amount of kernel > > > work. Enough to ask if it is worthwhile.. > > > > If we need a common userspace IOMMU subsystem like IOMMUfd that can > > handle driver page pinning, IOMMU faults, and dirty tracking, why does > > it suddenly become an unbearable burden to allow other means besides > > page pinning for a driver to relay DMA page writes? > > I can see some concrete reasons for iommufd, like it allows this code > to be shared between mutliple subsystems that need it. Its design > supports more functionality than the vfio container can. > > Here, I don't quite see it. If userspace does > > for (i = 0; i != num_trackers; i++) > ioctl(merge dirty bitmap, i, &bitmap) > > Or > ioctl(read diry bitmap, &bitmap) > > Hardly seems decisive. On the same order as "makes more sense to me" ;) > What bothers me is the overhead and kernel > complexity. > > If we split them it looks quite simple: > - new ioctl on vfio device to read & clear dirty bitmap > & extension flag to show this is supported > - DIRTY_TRACK migration state bit to start/stop Is this another device_state bit? > Simple logic that read is only possible in NDMA/!RUNNING > - new ioctl on vfio device to report dirty tracking capability flags: > - internal dirty track supported > - real dirty track through attached container supported > (only mdev can set this today) How does system IOMMU dirty tracking work? > Scenarios: > > a. If userspace has a mlx5 and a mdev then it does a for loop as above to > start and read the dirty bitmaps. > > b. If userspace has only mlx5 it reads one bitmap > > c. If userspace has mlx5 and some other PCI device it can activate > mlx5 and leave the container alone. Suspend the PCI device early. > Or directly give up on dirty track > > For funneling through the container ioctls.. Humm: > - Still track per device/container connection if internal/external > dirty track is supported. Add an new ioctl so userspace can have > this info for (c) > - For external dirty track have some ops callbacks for start/stop, > read bitmap and clear. (So, no migration state flag?) > - On start make the policy choice if the external/internal will be > used, then negotiate some uniform tracking size and iterate over > all externals to call start > - On read.. to avoid overheads iterate over the internal bitmap > and read ops on all external bitmaps and or them together > then copy to user. Just ignore NDMA and rely on userspace to > do it right? > - Clear iterates and zeros bitmaps > - Change logic to not allocate tracking bitmaps if no mdevs I don't understand what's being described here, I'm glad an attempt is being made to see what this might look like with the current interface, but at the same time the outline seems biased towards a complicated portrayal. It's a relatively simple progression, with no outside information the container exposes all mapped pages as dirty. Individual devices in the container can expose themselves as enlightened in various ways, initially we mark devices that pin pages as enlightened and when all devices in the container are enlightened we can reduce the reported bitmap. As devices add their own dirty tracking we can add interfaces to allow devices to register their status, mark dirty pages, and potentially poll devices to update their dirty pages against ranges requested by the user. A device that's suspended, for example in PCI D3 state, might mark itself as becoming enlightened and report no page dirtying, a device ioctl might allow a more explicit instruction for the same. As IOMMU dirty page tracking comes online, the IOMMU can essentially enlighten all the devices attached to the context. > a. works the same, kernel turns on both trackers > b. works the same, kernel turns on only mlx5 > c. Hum. We have to disable the 'no tracker report everything as > dirty' feature somehow so we can access only the mlx5 tracker > without forcing evey page seen as dirty. Needs a details Doesn't seem as complicated in my rendition. > And figure out how this relates to the ongoing iommufd project (which, > BTW, we have now invested a lot in too). > > I'm not as convinced as you are that the 2nd is obviously better, and > on principle I don't like avoidable policy choices baked in the > kernel. Userspace has the same degree of policy decision in my version, what's missing from my version is userspace visibility to the per device dirty footprint and the newly mentioned desire for per-device page granularity. > And I don't see that it makes userspace really much simpler anyhow. On > the other hand it looks like a lot more kernel work.. There's kernel work either way. > > OTOH, aggregating these features in the IOMMU reduces both overhead > > of per device bitmaps and user operations to create their own > > consolidated view. > > I don't understand this, funneling will not reduce overhead, at best > with some work we can almost break even by not allocating the SW > bitmaps. The moment we have more than one device that requires a bitmap, where the device doesn't have the visibility of the extent of the bitmap, we introduce both space and time overhead versus a shared bitmap that can be pre-allocated. What's being asked for here is a change from the current plan of record, that entails work and justification, including tangible benefits versus a fair exploration of how the same might work in the current design. Anyway, as Connie mentioned off-list, we're deep into a comment thread that's becoming more and more tangential to the original patch, I'm not sure how many relevant contributors are still following this, and it's probably best to propose a change in course to dirty bitmap tracking in a new thread. Thanks, Alex