On Tue, Nov 30, 2021 at 8:04 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote: > > On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > > > > > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote: > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > > > index 6768ef9c0891..4e78ef2dd352 100644 > > > > > --- a/arch/x86/kvm/mmu/mmu.c > > > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > > > > > gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask); > > > > > gfn_t end = slot->base_gfn + gfn_offset + __fls(mask); > > > > > > > > > > + /* > > > > > + * Try to proactively split any large pages down to 4KB so that > > > > > + * vCPUs don't have to take write-protection faults. > > > > > + */ > > > > > + kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K); > > > > > + > > > > > kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > > > > > > > > > > /* Cross two large pages? */ > > > > > > > > Is it intended to try split every time even if we could have split it already? > > > > As I remember Paolo mentioned that we can skip split if it's not the 1st > > > > CLEAR_LOG on the same range, and IIUC that makes sense. > > > > > > > > But indeed I don't see a trivial way to know whether this is the first clear of > > > > this range. Maybe we can maintain "how many huge pages are there under current > > > > kvm_mmu_page node" somehow? Then if root sp has the counter==0, then we can > > > > skip it. Just a wild idea.. > > > > > > > > Or maybe it's intended to try split unconditionally for some reason? If so > > > > it'll be great to mention that either in the commit message or in comments. > > > > > > Thanks for calling this out. Could the same be said about the existing > > > code that unconditionally tries to write-protect 2M+ pages? > > They're different because wr-protect can be restored (to be not-wr-protected) > when vcpu threads write to the pages, so they need to be always done. That's true for 4K pages, but not for write-protecting 2M+ pages (which is what we're discussing here). Once KVM write-protects a 2M+ page, it should never need to write-protect it again, but we always try to here. Same goes with splitting. > > For huge page split - when it happened during dirty tracking it'll not be > recovered anymore, so it's a one-time thing. > > > > I aimed to keep parity with the write-protection calls (always try to split > > > before write-protecting) but I agree there might be opportunities available > > > to skip altogether. > > So IMHO it's not about parity but it could be about how easy can it be > implemented, and whether it'll be worth it to add that complexity. Agreed. > > Besides the above accounting idea per-sp, we can have other ways to do this > too, e.g., keeping a bitmap showing which range has been split: that bitmap > will be 2M in granule for x86 because that'll be enough. We init-all-ones for > the bitmap when start logging for a memslot. > > But again maybe it turns out we don't really want that complexity. > > IMHO a good start could be the perf numbers (which I asked in the cover letter) > comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page > split. Ack. I'll be sure to include these in v1! > > > > > > > By the way, looking at this code again I think I see some potential bugs: > > > - I don't think I ever free split_caches in the initially-all-set case. > > I saw that it's freed in kvm_mmu_try_split_large_pages(), no? Ah yes you are right. I misremembered how I implemented it and thought we kept the split_caches around across calls to CLEAR_LOG. (We probably should TBH. The current implementation is quite wasteful.) > > > > - What happens if splitting fails the CLEAR_LOG but succeeds the > > > CLEAR_LOG? > > > > Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here. > > > > > We would end up propagating the write-protection on the 2M > > > page down to the 4K page. This might cause issues if using PML. > > Hmm looks correct.. I'm wondering what will happen with that. > > Firstly this should be rare as the 1st split should in 99% cases succeed. > > Then if split failed at the 1st attempt, we wr-protected sptes even during pml > during the split. When written, we'll go the fast page fault and record the > writes too, I think, as we'll apply dirty bit to the new spte so I think it'll > just skip pml. Looks like we'll be using a mixture of pml+wp but all dirty > will still be captured as exptected?.. That's what I was hoping for. I'll double check for v1. > > There could be leftover wp when stopping dirty logging, but that seems not > directly harmful too. It'll make things a bit messed up, at least. > > -- > Peter Xu >