Re: [RFC PATCH 0/6] Enable shared device assignment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 7/26/2024 3:20 PM, David Hildenbrand wrote:
> On 26.07.24 08:20, Chenyi Qiang wrote:
>>
>>
>> On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>>>> Open
>>>> ====
>>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>>> causes changes in semantics: private memory is treated as discarded (or
>>>> hot-removed) memory. This isn't aligned with the expectation of current
>>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>>> expect that discarded memory is hot-removed and thus can be skipped
>>>> when
>>>> the users are processing guest memory. Treating private memory as
>>>> discarded won't work in future if VFIO or live migration needs to
>>>> handle
>>>> private memory. e.g. VFIO may need to map private memory to support
>>>> Trusted IO and live migration for confidential VMs need to migrate
>>>> private memory.
>>>
>>> "VFIO may need to map private memory to support Trusted IO"
>>>
>>> I've been told that the way we handle shared memory won't be the way
>>> this is going to work with guest_memfd. KVM will coordinate directly
>>> with VFIO or $whatever and update the IOMMU tables itself right in the
>>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>>> work. So I don't consider that currently a concern. guest_memfd private
>>> memory is not mapped into user page tables and as it currently seems it
>>> never will be.
>>
>> That's correct. AFAIK, some TEE IO solution like TDX Connect would let
>> kernel coordinate and update private mapping in IOMMU tables. Here, It
>> mentions that VFIO "may" need map private memory. I want to make this
>> more generic to account for potential future TEE IO solutions that may
>> require such functionality. :)
> 
> Careful to not over-enginner something that is not even real or
> close-to-be-real yet, though. :) Nobody really knows who that will look
> like, besides that we know for Intel that we won't need that.

OK, Thanks for the reminder!

> 
>>
>>>
>>> Similarly: live migration. We cannot simply migrate that memory the
>>> traditional way. We even have to track the dirty state differently.
>>>
>>> So IMHO, treating both memory as discarded == don't touch it the usual
>>> way might actually be a feature not a bug ;)
>>
>> Do you mean treating the private memory in both VFIO and live migration
>> as discarded? That is what this patch series does. And as you mentioned,
>> these RDM users cannot follow the traditional RDM way. Because of this,
>> we also considered whether we should use RDM or a more generic mechanism
>> like notifier_list below.
> 
> Yes, the shared memory is logically discarded. At the same time we
> *might* get private memory effectively populated. See my reply to Kevin
> that there might be ways of having shared vs. private populate/discard
> in the future, if required. Just some idea, though.
> 
>>
>>>
>>>>
>>>> There are two possible ways to mitigate the semantics changes.
>>>> 1. Develop a new mechanism to notify the page conversions between
>>>> private and shared. For example, utilize the notifier_list in QEMU.
>>>> VFIO
>>>> registers its own handler and gets notified upon page conversions. This
>>>> is a clean approach which only touches the notifier workflow. A
>>>> challenge is that for device hotplug, existing shared memory should be
>>>> mapped in IOMMU. This will need additional changes.
>>>>
>>>> 2. Extend the existing RamDiscardManager interface to manage not only
>>>> the discarded/populated status of guest memory but also the
>>>> shared/private status. RamDiscardManager users like VFIO will be
>>>> notified with one more argument indicating what change is happening and
>>>> can take action accordingly. It also has challenges e.g. QEMU allows
>>>> only one RamDiscardManager, how to support virtio-mem for confidential
>>>> VMs would be a problem. And some APIs like .is_populated() exposed by
>>>> RamDiscardManager are meaningless to shared/private memory. So they may
>>>> need some adjustments.
>>>
>>> Think of all of that in terms of "shared memory is populated, private
>>> memory is some inaccessible stuff that needs very special way and other
>>> means for device assignment, live migration, etc.". Then it actually
>>> quite makes sense to use of RamDiscardManager (AFAIKS :) ).
>>
>> Yes, such notification mechanism is what we want. But for the users of
>> RDM, it would require additional change accordingly. Current users just
>> skip inaccessible stuff, but in private memory case, it can't be simply
>> skipped. Maybe renaming RamDiscardManager to RamStateManager is more
>> accurate then. :)
> 
> Current users must skip it, yes. How private memory would have to be
> handled, and who would handle it, is rather unclear.
> 
> Again, maybe we'd want separate RamDiscardManager for private and shared
> memory (after all, these are two separate memory backends).

We also considered distinguishing the populate and discard operation for
private and shared memory separately. As in method 2 above, we mentioned
to add a new argument to indicate the memory attribute to operate on.
They seem to have a similar idea.

> 
> Not sure that "RamStateManager" terminology would be reasonable in that
> approach.
> 
>>
>>>
>>>>
>>>> Testing
>>>> =======
>>>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>>>
>>>> To facilitate shared device assignment with the NIC, employ the legacy
>>>> type1 VFIO with the QEMU command:
>>>>
>>>> qemu-system-x86_64 [...]
>>>>       -device vfio-pci,host=XX:XX.X
>>>>
>>>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>>>> 16GB guest needs to adjust the parameter like
>>>> vfio_iommu_type1.dma_entry_limit=4194304.
>>>
>>> But here you note the biggest real issue I see (not related to
>>> RAMDiscardManager, but that we have to prepare for conversion of each
>>> possible private page to shared and back): we need a single IOMMU
>>> mapping for each 4 KiB page.
>>>
>>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
>>> Does it even scale then?
>>
>> The entry limitation needs to be increased as the guest memory size
>> increases. For this issue, are you concerned that having too many
>> entries might bring some performance issue? Maybe we could introduce
>> some PV mechanism to coordinate with guest to convert memory only in 2M
>> granularity. This may help mitigate the problem.
> 
> I've had this talk with Intel, because the 4K granularity is a pain. I
> was told that ship has sailed ... and we have to cope with random 4K
> conversions :(
> 
> The many mappings will likely add both memory and runtime overheads in
> the kernel. But we only know once we measure.

In the normal case, the main runtime overhead comes from
private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
maximum of 1Gbyte. I think this overhead is acceptable. In non-default
case, e.g. dynamic allocated DMA buffer, the runtime overhead will
increase. As for the memory overheads, It is indeed unavoidable.

Will these performance issues be a deal breaker for enabling shared
device assignment in this way?

> 
> Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB
> of shared memory :/
> 
>>
>>>
>>>
>>> There is the alternative of having in-place private/shared conversion
>>> when we also let guest_memfd manage some shared memory. It has plenty of
>>> downsides, but for the problem at hand it would mean that we don't
>>> discard on shared/private conversion.>
>>> But whenever we want to convert memory shared->private we would
>>> similarly have to from IOMMU page tables via VFIO. (the in-place
>>> conversion will only be allowed if any additional references on a page
>>> are gone -- when it is inaccessible by userspace/kernel).
>>
>> I'm not clear about this in-place private/shared conversion. Can you
>> elaborate a little bit? It seems this alternative changes private and
>> shared management in current guest_memfd?
> 
> Yes, there have been discussions about that, also in the context of
> supporting huge pages while allowing for the guest to still convert
> individual 4K chunks ...
> 
> A summary is here [1]. Likely more things will be covered at Linux
> Plumbers.
> 
> 
> [1]
> https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@xxxxxxxxxx/
> 

Thanks for your sharing.





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux