On 7/26/2024 3:20 PM, David Hildenbrand wrote: > On 26.07.24 08:20, Chenyi Qiang wrote: >> >> >> On 7/25/2024 10:04 PM, David Hildenbrand wrote: >>>> Open >>>> ==== >>>> Implementing a RamDiscardManager to notify VFIO of page conversions >>>> causes changes in semantics: private memory is treated as discarded (or >>>> hot-removed) memory. This isn't aligned with the expectation of current >>>> RamDiscardManager users (e.g. VFIO or live migration) who really >>>> expect that discarded memory is hot-removed and thus can be skipped >>>> when >>>> the users are processing guest memory. Treating private memory as >>>> discarded won't work in future if VFIO or live migration needs to >>>> handle >>>> private memory. e.g. VFIO may need to map private memory to support >>>> Trusted IO and live migration for confidential VMs need to migrate >>>> private memory. >>> >>> "VFIO may need to map private memory to support Trusted IO" >>> >>> I've been told that the way we handle shared memory won't be the way >>> this is going to work with guest_memfd. KVM will coordinate directly >>> with VFIO or $whatever and update the IOMMU tables itself right in the >>> kernel; the pages are pinned/owned by guest_memfd, so that will just >>> work. So I don't consider that currently a concern. guest_memfd private >>> memory is not mapped into user page tables and as it currently seems it >>> never will be. >> >> That's correct. AFAIK, some TEE IO solution like TDX Connect would let >> kernel coordinate and update private mapping in IOMMU tables. Here, It >> mentions that VFIO "may" need map private memory. I want to make this >> more generic to account for potential future TEE IO solutions that may >> require such functionality. :) > > Careful to not over-enginner something that is not even real or > close-to-be-real yet, though. :) Nobody really knows who that will look > like, besides that we know for Intel that we won't need that. OK, Thanks for the reminder! > >> >>> >>> Similarly: live migration. We cannot simply migrate that memory the >>> traditional way. We even have to track the dirty state differently. >>> >>> So IMHO, treating both memory as discarded == don't touch it the usual >>> way might actually be a feature not a bug ;) >> >> Do you mean treating the private memory in both VFIO and live migration >> as discarded? That is what this patch series does. And as you mentioned, >> these RDM users cannot follow the traditional RDM way. Because of this, >> we also considered whether we should use RDM or a more generic mechanism >> like notifier_list below. > > Yes, the shared memory is logically discarded. At the same time we > *might* get private memory effectively populated. See my reply to Kevin > that there might be ways of having shared vs. private populate/discard > in the future, if required. Just some idea, though. > >> >>> >>>> >>>> There are two possible ways to mitigate the semantics changes. >>>> 1. Develop a new mechanism to notify the page conversions between >>>> private and shared. For example, utilize the notifier_list in QEMU. >>>> VFIO >>>> registers its own handler and gets notified upon page conversions. This >>>> is a clean approach which only touches the notifier workflow. A >>>> challenge is that for device hotplug, existing shared memory should be >>>> mapped in IOMMU. This will need additional changes. >>>> >>>> 2. Extend the existing RamDiscardManager interface to manage not only >>>> the discarded/populated status of guest memory but also the >>>> shared/private status. RamDiscardManager users like VFIO will be >>>> notified with one more argument indicating what change is happening and >>>> can take action accordingly. It also has challenges e.g. QEMU allows >>>> only one RamDiscardManager, how to support virtio-mem for confidential >>>> VMs would be a problem. And some APIs like .is_populated() exposed by >>>> RamDiscardManager are meaningless to shared/private memory. So they may >>>> need some adjustments. >>> >>> Think of all of that in terms of "shared memory is populated, private >>> memory is some inaccessible stuff that needs very special way and other >>> means for device assignment, live migration, etc.". Then it actually >>> quite makes sense to use of RamDiscardManager (AFAIKS :) ). >> >> Yes, such notification mechanism is what we want. But for the users of >> RDM, it would require additional change accordingly. Current users just >> skip inaccessible stuff, but in private memory case, it can't be simply >> skipped. Maybe renaming RamDiscardManager to RamStateManager is more >> accurate then. :) > > Current users must skip it, yes. How private memory would have to be > handled, and who would handle it, is rather unclear. > > Again, maybe we'd want separate RamDiscardManager for private and shared > memory (after all, these are two separate memory backends). We also considered distinguishing the populate and discard operation for private and shared memory separately. As in method 2 above, we mentioned to add a new argument to indicate the memory attribute to operate on. They seem to have a similar idea. > > Not sure that "RamStateManager" terminology would be reasonable in that > approach. > >> >>> >>>> >>>> Testing >>>> ======= >>>> This patch series is tested based on the internal TDX KVM/QEMU tree. >>>> >>>> To facilitate shared device assignment with the NIC, employ the legacy >>>> type1 VFIO with the QEMU command: >>>> >>>> qemu-system-x86_64 [...] >>>> -device vfio-pci,host=XX:XX.X >>>> >>>> The parameter of dma_entry_limit needs to be adjusted. For example, a >>>> 16GB guest needs to adjust the parameter like >>>> vfio_iommu_type1.dma_entry_limit=4194304. >>> >>> But here you note the biggest real issue I see (not related to >>> RAMDiscardManager, but that we have to prepare for conversion of each >>> possible private page to shared and back): we need a single IOMMU >>> mapping for each 4 KiB page. >>> >>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. >>> Does it even scale then? >> >> The entry limitation needs to be increased as the guest memory size >> increases. For this issue, are you concerned that having too many >> entries might bring some performance issue? Maybe we could introduce >> some PV mechanism to coordinate with guest to convert memory only in 2M >> granularity. This may help mitigate the problem. > > I've had this talk with Intel, because the 4K granularity is a pain. I > was told that ship has sailed ... and we have to cope with random 4K > conversions :( > > The many mappings will likely add both memory and runtime overheads in > the kernel. But we only know once we measure. In the normal case, the main runtime overhead comes from private<->shared flip in SWIOTLB, which defaults to 6% of memory with a maximum of 1Gbyte. I think this overhead is acceptable. In non-default case, e.g. dynamic allocated DMA buffer, the runtime overhead will increase. As for the memory overheads, It is indeed unavoidable. Will these performance issues be a deal breaker for enabling shared device assignment in this way? > > Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB > of shared memory :/ > >> >>> >>> >>> There is the alternative of having in-place private/shared conversion >>> when we also let guest_memfd manage some shared memory. It has plenty of >>> downsides, but for the problem at hand it would mean that we don't >>> discard on shared/private conversion.> >>> But whenever we want to convert memory shared->private we would >>> similarly have to from IOMMU page tables via VFIO. (the in-place >>> conversion will only be allowed if any additional references on a page >>> are gone -- when it is inaccessible by userspace/kernel). >> >> I'm not clear about this in-place private/shared conversion. Can you >> elaborate a little bit? It seems this alternative changes private and >> shared management in current guest_memfd? > > Yes, there have been discussions about that, also in the context of > supporting huge pages while allowing for the guest to still convert > individual 4K chunks ... > > A summary is here [1]. Likely more things will be covered at Linux > Plumbers. > > > [1] > https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@xxxxxxxxxx/ > Thanks for your sharing.