> From: Peter Xu > Sent: Friday, April 24, 2020 10:20 PM > > On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote: > > > From: Peter Xu <peterx@xxxxxxxxxx> > > > Sent: Thursday, April 23, 2020 11:23 PM > > > > > > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote: > > > > > From: Peter Xu <peterx@xxxxxxxxxx> > > > > > Sent: Thursday, April 23, 2020 2:52 AM > > > > > > > > > > Hi, > > > > > > > > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead > of > > > > > (slot_id, > > > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind > > > kvm_dirty_gfn > > > > > with memslots. > > > > > > > > > > (A slightly longer version starts...) > > > > > > > > > > The problem is that binding dirty tracking operations to KVM > memslots is > > > a > > > > > restriction that needs synchronization to memslot changes, which > further > > > > > needs > > > > > synchronization across all the vcpus because they're the consumers of > > > > > memslots. > > > > > E.g., when we remove a memory slot, we need to flush all the dirty > bits > > > > > correctly before we do the removal of the memslot. That's actually an > > > > > known > > > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other > > > > > hypervisors...) right now with current dirty logging. Meanwhile, even > if > > > we > > > > > fix it, that procedure is not scale at all, and error prone to dead locks. > > > > > > > > > > Here memory removal is really an (still corner-cased but relatively) > > > important > > > > > scenario to think about for dirty logging comparing to memory > additions > > > & > > > > > movings. Because memory addition will always have no initial dirty > page, > > > > > and > > > > > we don't really move RAM a lot (or do we ever?!) for a general VM > use > > > case. > > > > > > > > > > Then I went a step back to think about why we need these dirty bit > > > > > information > > > > > after all if the memslot is going to be removed? > > > > > > > > > > There're two cases: > > > > > > > > > > - When the memslot is going to be removed forever, then the dirty > > > > > information > > > > > is indeed meaningless and can be dropped, and, > > > > > > > > > > - When the memslot is going to be removed but quickly added back > with > > > > > changed > > > > > size, then we need to keep those dirty bits because it's just a > commmon > > > > > way > > > > > to e.g. punch an MMIO hole in an existing RAM region (here I'd > confess > > > I > > > > > feel like using "slot_id" to identify memslot is really unfriendly > syscall > > > > > design for things like "hole punchings" in the RAM address space... > > > > > However such "punch hold" operation is really needed even for a > > > common > > > > > guest for either system reboots or device hotplugs, etc.). > > > > > > > > why would device hotplug punch a hole in an existing RAM region? > > > > > > I thought it could happen because I used to trace the KVM ioctls and see > the > > > memslot changes during driver loading. But later when I tried to hotplug > a > > > > Is there more detail why driver loading may lead to memslot changes? > > E.g., I can observe these after Linux loads and before the prompt, which is a > simplest VM with default devices on: > > 41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0 > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0 > 41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1 > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0 > 41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1 > gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0 > > After a careful look, I noticed it's only the VGA device mostly turning slot 3 > off & on. Frankly speaking I don't know why it happens to do so. > > > > > > device I do see that it won't... The new MMIO regions are added only > into > > > 0xfe000000 for a virtio-net: > > > > > > 00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common > > > 00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr > > > 00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device > > > 00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify > > > 00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table > > > 00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba > > > > > > Does it mean that device plugging is guaranteed to not trigger RAM > changes? > > > > I'd think so. Otherwise from guest p.o.v any device hotplug implies doing > > a memory hot-unplug first then it's a bad design. > > Right that's what I was confused about. Then maybe you're right. :) > > > > > > I > > > am really curious about what cases we need to consider in which we > need to > > > keep > > > the dirty bits for a memory removal, and if system reset is the only case, > then > > > it could be even easier (because we might be able to avoid the sync in > > > memory > > > removal but do that once in a sys reset hook)... > > > > Possibly memory hot-unplug, as allowed by recent virtio-mem? > > That should belong to the case where dirty bits do not matter at all after the > removal, right? I would be mostly curious about when we (1) remove a > memory > slot, and at the meantime (2) we still care about the dirty bits of that slot. I remember one case. PCIe spec defines a resizable BAR capability, allowing hardware to communicate supported resource sizes and software to set an optimal size back to hardware. If such a capability is presented to guest and the BAR is backed by memory, it's possible to observe a removal-and- add-back scenario. However in such case, the spec requires the software to clear memory space enable bit in command register before changing the BAR size. I suppose such thing should happen at boot phase where dirty bits related to old BAR size don't matter. > > I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(), > which I > think is really nasty. > > > > > btw VFIO faces a similar problem when unmapping a DMA range (e.g. > when > > vIOMMU is enabled) in dirty log phase. There could be some dirty bits > which are > > not retrieved when unmapping happens. VFIO chooses to return the dirty > > bits in a buffer passed in the unmapping parameters. Can memslot > interface > > do similar thing by allowing the userspace to specify a buffer pointer to > hold > > whatever dirty pages recorded for the slot that is being removed? > > Yes I think we can, but may not be necessary. Actually IMHO CPU access to > pages are slightly different to device DMAs in that we can do these sequence > to > collect the dirty bits of a slot safely: > > - mark slot as READONLY > - KVM_GET_DIRTY_LOG on the slot > - finally remove the slot > > I guess VFIO cannot do that because there's no way to really "mark the > region > as read-only" for a device because DMA could happen and DMA would fail > then > when writting to a readonly slot. > > On the KVM/CPU side, after we mark the slot as READONLY then the CPU > writes > will page fault and fallback to the QEMU userspace, then QEMU will take care > of > the writes (so those writes could be slow but still working) then even if we > mark it READONLY it won't fail the writes but just fallback to QEMU. Yes, you are right. > > Btw, since we're discussing the VFIO dirty logging across memory removal... > the > unmapping DMA range you're talking about needs to be added back later, > right? pure memory removal doesn't need add-back. Or are you specifically referring to removal-and-add-back case? > Then what if the device does DMA after the removal but before it's added > back? then just got IOMMU page fault, but I guess that I didn't get your real question... > I think this is a more general question not for dirty logging but also for when > dirty logging is not enabled - I never understand how this could be fixed with > existing facilities. > Thanks, Kevin