Hi Sean and Marc,
On 10/28/22 2:30 AM, Marc Zyngier wrote:
On Thu, 27 Oct 2022 18:44:51 +0100,
Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
On Thu, Oct 27, 2022, Marc Zyngier wrote:
On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
[...]
And ideally such bugs would detected without relying on userspace to
enabling dirty logging, e.g. the Hyper-V bug lurked for quite some
time and was only found when mark_page_dirty_in_slot() started
WARNing.
I'm ok if arm64 wants to let userspace shoot itself in the foot with
the ITS, but I'm not ok dropping the protections in the common
mark_page_dirty_in_slot().
One somewhat gross idea would be to let architectures override the
"there must be a running vCPU" rule, e.g. arm64 could toggle a flag
in kvm->arch in its kvm_write_guest_lock() to note that an expected
write without a vCPU is in-progress:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8c5c69ba47a7..d1da8914f749 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
#ifdef CONFIG_HAVE_KVM_DIRTY_RING
- if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm))
+ if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu))
+ return;
+
+ if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
return;
#endif
@@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
unsigned long rel_gfn = gfn - memslot->base_gfn;
u32 slot = (memslot->as_id << 16) | memslot->id;
- if (kvm->dirty_ring_size)
+ if (kvm->dirty_ring_size && vcpu)
kvm_dirty_ring_push(&vcpu->dirty_ring,
slot, rel_gfn);
- else
+ else if (memslot->dirty_bitmap)
set_bit_le(rel_gfn, memslot->dirty_bitmap);
}
}
I think this is equally wrong. Writes occur from both CPUs and devices
*concurrently*, and I don't see why KVM should keep ignoring this
pretty obvious fact.
Yes, your patch papers over the problem, and it can probably work if
the kvm->arch flag only gets set in the ITS saving code, which is
already exclusive of vcpus running.
But in the long run, with dirty bits being collected from the IOMMU
page tables or directly from devices, we will need a way to reconcile
the dirty tracking. The above doesn't quite cut it, unfortunately.
Oooh, are you referring to IOMMU page tables and devices _in the
guest_? E.g. if KVM itself were to emulate a vIOMMU, then KVM would
be responsible for updating dirty bits in the vIOMMU page tables.
No. I'm talking about the *physical* IOMMU, which is (with the correct
architecture revision and feature set) capable of providing its own
set of dirty bits, on a per-device, per-PTE basis. Once we enable
that, we'll need to be able to sink these bits into the bitmap and
provide a unified view of the dirty state to userspace.
Not that it really matters, but do we actually expect KVM to ever
emulate a vIOMMU? On x86 at least, in-kernel acceleration of vIOMMU
emulation seems more like VFIO territory.
I don't expect KVM/arm64 to fully emulate an IOMMU, but at least to
eventually provide the required filtering to enable a stage-1 SMMU to
be passed to a guest. This is the sort of things pKVM needs to
implement for the host anyway, and going the extra mile to support
arbitrary guests outside of the pKVM context isn't much more work.
Regardless, I don't think the above idea makes it any more difficult
to support in-KVM emulation of non-CPU stuff, which IIUC is the ITS
case. I 100% agree that the above is a hack, but that's largely due
to the use of kvm_get_running_vcpu().
That I agree.
A slightly different alternative would be have a completely separate
API for writing guest memory without an associated vCPU. I.e. start
building up proper device emulation support. Then the vCPU-based
APIs could yell if a vCPU isn't provided (or there is no running
vCPU in the current mess). And the deviced-based API could be
provided if and only if the architecture actually supports emulating
writes from devices, i.e. x86 would not opt-in and so would even
have access to the API.
Which is what I was putting under the "major surgery" label in my
previous email.
Anyhow, for the purpose of unblocking Gavin's series, I suggest to
adopt your per-arch opt-out suggestion as a stop gap measure, and we
will then be able to bike-shed for weeks on what the shape of the
device-originated memory dirtying API should be.
It's really a 'major surgery' and I would like to make sure I fully understand
'a completely separate API for writing guest memory without an associated vCPU",
before I'm going to working on v7 for this.
There are 7 functions and 2 macros involved as below. I assume Sean is suggesting
to add another argument, whose name can be 'has_vcpu', for these functions and macros?
Sean, could you please double confirm?
If I'm understanding correctly, and 'has_vcpu' argument will be added for these
functions and macros. Except the call sites in vgic/its, 'has_vcpu' is set to 'true',
and passed to these functions. It means we need a 'false' for the argument in vgic/its
call sites. Please correct me if I'm wrong.
int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
int offset, int len);
int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
unsigned long len);
int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
void *data, unsigned long len);
int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
void *data, unsigned int offset,
unsigned long len);
void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn);
#define __kvm_put_guest(kvm, gfn, offset, v)
#define kvm_put_guest(kvm, gpa, v)
Thanks,
Gavin