* Idea The present bit of page fault error code (EFEC.P) indicates whether the page table is populated on all levels, if this bit is set, we can know the page fault is caused by the page-protection bits (e.g. W/R bit) or the reserved bits. In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be simply fixed: the page fault caused by reserved bit (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1) is just increasing the corresponding access on the spte. This pachset introduces a fast path to fix this kind of page fault: it is out of mmu-lock and need not walk host page table to get the mapping from gfn to pfn. * Advantage - it is really fast it fixes page fault out of mmu-lock, and uses a very light way to avoid the race with other pathes. Also, it fixes page fault in the front of gfn_to_pfn, it means no host page table walking. - we can get lots of page fault with PFEC.P = 1 in KVM: - in the case of ept/npt after shadow page become stable (all gfn is mapped in shadow page table, it is a short stage since only one shadow page table is used and only a few of page is needed), almost all page fault is caused by write-protect (frame-buffer under Xwindow, migration), the other small part is caused by page merge/COW under KSM/THP. We do not hope it can fix the page fault caused by the read-only host page of KSM, since after COW, all the spte pointing to the gfn will be unmapped. - in the case of soft mmu - many spurious page fault due to tlb lazily flushed - lots of write-protect page fault (dirty bit track for guest pte, shadow page table write-protected, frame-buffer under Xwindow, migration, ...) * Implementation We can freely walk the page between walk_shadow_page_lockless_begin and walk_shadow_page_lockless_end, it can ensure all the shadow page is valid. In the most case, cmpxchg is fair enough to change the access bit of spte, but the write-protect path on softmmu/nested mmu is a especial case: it is a read-check-modify path: read spte, check W bit, then clear W bit. In order to avoid marking spte writable after/during page write-protect, we do the trick like below: fast page fault path: lock RCU set identification in the spte smp_mb() if (!rmap.PTE_LIST_WRITE_PROTECT) cmpxchg + w - vcpu-id unlock RCU write protect path: lock mmu-lock set rmap.PTE_LIST_WRITE_PROTECT smp_mb() if (spte.w || spte has identification) clear w bit and identification unlock mmu-lock Setting identification in the spte is used to notify page-protect path to modify the spte, then we can see the change in the cmpxchg. Setting identification is also a trick: it only set the last bit of spte that does not change the mapping and lose cpu status bits. The identification should be unique to avoid the below race: VCPU 0 VCPU 1 VCPU 2 lock RCU spte + identification check conditions do write-protect, clear identification lock RCU set identification cmpxchg + w - identification OOPS!!! We choose the vcpu id as the unique value, currently, 254 vcpus on VMX and 127 vcpus on softmmu can be fast. Keep it simply firtsly. :) * Performance It introduces a full memory barrier on the page write-protect path, i have done the test of kernbench in the text mode which does not generate write-protect page fault by frame-buffer avoiding the optimization introduced by this patch, it shows no regression. And there is the result tested by x11perf and migration on autotest: x11perf (x11perf -repeat 10 -comppixwin500): (Host: Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz * 4 + 4G Guest: 4 vcpus + 1G) - For ept: $ x11perfcomp baseline-hard optimaze-hard 1: baseline-hard 2: optimaze-hard 1 2 Operation -------- -------- --------- 7060.0 7150.0 Composite 500x500 from pixmap to window - For shadow mmu: $ x11perfcomp baseline-soft optimaze-soft 1: baseline-soft 2: optimaze-soft 1 2 Operation -------- -------- --------- 6980.0 7490.0 Composite 500x500 from pixmap to window ( It is interesting that after this patch, the performance of x11perf on softmmu is better than it on hardmmu, i have tested it for many times, it is really true. :) ) autotest migration: (Host: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz * 12 + 32G) - For ept: Before: smp2.Fedora.16.64.migrate Times .unix .with_autotest.dbench.unix total 1 102 204 309 2 68 203 275 3 67 218 289 After: smp2.Fedora.16.64.migrate Times .unix .with_autotest.dbench.unix total 1 103 189 295 2 67 188 259 3 64 202 271 - For shadow mmu: Before: smp2.Fedora.16.64.migrate Times .unix .with_autotest.dbench.unix total 1 102 262 368 2 68 220 292 3 68 234 307 After: smp2.Fedora.16.64.migrate Times .unix .with_autotest.dbench.unix total 1 104 231 341 2 68 218 289 3 66 205 275 Any comments are welcome. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html