On Tue, Nov 16, 2021 at 10:57:36AM -0700, Alex Williamson wrote: > > I think userspace should decide if it wants to use mlx5 built in or > > the system IOMMU to do dirty tracking. > > What information does userspace use to inform such a decision? Kernel can't know which approach performs better. Operators should benchmark and make a choice for their deployment HW. Maybe device tracking severely impacts device performance or vice versa. Kernel doesn't easily know what userspace has done, maybe one device supports migration driver dirty tracking and one device does not. Is user space going to use a system IOMMU for both devices? Is it going to put the simple device in NDMA early and continue to dirty track to shutdown the other devices? > Ultimately userspace just wants the finest granularity of tracking, > shouldn't that guide our decisions which to provide? At least for mlx5 there is going to some trade off curve of device performance, dirty tracking page size, and working set. Even lower is better is not necessarily true. After overheads on a 400GB RDMA NIC there is not such a big difference between doing a 4k and 16k scatter transfer. The CPU work to process all the extra bitmap data may not be a net win compared to block transfer times. Conversly someone doing 1G TCP transfers probably cares a lot to minimize block size. Overall, I think there is far too much up in the air and unmeasured to firmly commit the kernel to a fixed policy. So, I would like to see userspace control most of the policy aspects, including the dirty track provider. > I believe the intended progression of dirty tracking is that by default > all mapped ranges are dirty. If the device supports page pinning, then > we reduce the set of dirty pages to those pages which are pinned. A > device that doesn't otherwise need page pinning, such as a fully IOMMU How does userspace know if dirty tracking works or not? All I see VFIO_IOMMU_DIRTY_PAGES_FLAG_START unconditionally allocs some bitmaps. I'm surprised it doesn't check that only NO_IOMMU's devices are attached to the container and refuse to dirty track otherwise - since it doesn't work.. > backed device, would use gratuitous page pinning triggered by the > _SAVING state activation on the device. It sounds like mlx5 could use > this existing support today. How does mlx5 know if it should turn on its dirty page tracking on SAVING or if the system IOMMU covers it? Or for some reason userspace doesn't want dirty tracking but is doing pre-copy? When we mix dirty track with pre-copy, the progression seems to be: DITRY TRACKING | RUNNING Copy every page to the remote DT | SAVING | RUNNING Copy pre-copy migration data to the remote SAVING | NDMA | RUNNING Read and clear dirty track device bitmap DT | SAVING | RUNNING Copy new dirtied data (maybe loop back to NDMA a few times?) SAVING | NDMA | RUNNING P2P grace state 0 Read the dirty track and copy data Read and send the migration state Can we do something so complex using only SAVING? .. and along the lines of the above how do we mix in NDMA to the iommu container, and how does it work if only some devices support NDMA? > We had also discussed variants to page pinning that might be more > useful as device dirty page support improves. For example calls to > mark pages dirty once rather than the perpetual dirtying of pinned > pages, calls to pin pages for read vs write, etc. We didn't dive much > into system IOMMU dirtying, but presumably we'd have a fault handler > triggered if a page is written by the device and go from there. Would be interesting to know for sure what current IOMMU HW has done. I'm supposing the easiest implementation is to write a dirty bit to the IO PTE the same as the CPU writes a dirty bit the normal PTE. > > In light of all this I'm wondering if device dirty tracking should > > exist as new ioctls on the device FD and reserve the type1 code to > > only work the IOMMU dirty tracking. > > Our existing model is working towards the IOMMU, ie. container, > interface aggregating dirty page context. This creates inefficiencies in the kernel, we copy from the mlx5 formed data structure to new memory in the iommu through a very ineffficent API and then again we do an ioctl to copy it once more and throw all the extra work away. It does not seem good for something where we want performance. > For example when page pinning is used, it's only when all devices > within the container are using page pinning that we can report the > pinned subset as dirty. Otherwise userspace needs to poll each > device, which I suppose enables your idea that userspace decides > which source to use, but why? Efficiency, and user selectable policy. Userspace can just allocate an all zeros bitmap and feed it to each of the providers in the kernel using a 'or in your dirty' semantic. No redundant kernel data marshaling, userspace gets to decide which tracking provider to use, and it is simple to implement in the kernel. Userspace has to do this anyhow if it has configurations with multiple containers. For instance because it was forced to split the containers due to one device not supporting NDMA. > Does the IOMMU dirty page tracking exclude devices if the user > queries the device separately? What makes sense to me is multiple tracking providers. Each can be turned on and off. If the container tracking provider says it supports tracking then it means it can track DMA from every device it is connected to (unlike today?). eg by using IOMMU HW that naturally does this, or by only having only NO_IOMMU devices. If the migration driver says it supports tracking, then it only tracks DMA from that device. > How would it know? What's the advantage? It seems like this > creates too many support paths that all need to converge on the same > answer. Consolidating DMA dirty page tracking to the DMA mapping > interface for all devices within a DMA context makes more sense to > me. What I see is a lot of questions and limitations with this approach. If we stick to funneling everything through the iommu then answering the questions seem to create a large amount of kernel work. Enough to ask if it is worthwhile.. .. and then we have to ask how does this all work in IOMMUFD where it is not so reasonable to tightly couple the migration driver and the IOAS and I get more questions :) Jason