[PATCH 0/11] KVM: x86: optimize for guest page written

Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxx> · Tue, 26 Jul 2011 19:24:11 +0800

Too keep shadow page consistency, we should write-protect the guest page if
if it is a page structure. Unfortunately, even if the guest page structure is
tear-down and is used for other usage, we still write-protect it and cause page
fault if it is written, in this case, we need to zap the corresponding shadow
page and let the guest page became normal as possible, that is just what
kvm_mmu_pte_write does, however, sometimes, it does not work well:
- kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
  function when spte is prefetched, unfortunately, we can not know how many
  spte need to be prefetched on this path, that means we can use out of the
  free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
  path does not fill the cache, such as INS instruction emulated that does not
  trigger page fault.

- we usually use repeat string instructions to clear the page, for example,
  we call memset to clear a page table, 'stosb' is used in this function, and 
  repeated for 1024 times, that means we should occupy mmu lock for 1024 times
  and walking shadow page cache for 1024 times, it is terrible.

- Sometimes, we only modify the last one byte of a pte to update status bit,
  for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
  instruction is used in this function, in this case, kvm_mmu_pte_write will
  treat it as misaligned access, and the shadow page table is zapped.

- detecting write-flooding does not work well, when we handle page written, if
  the last speculative spte is not accessed, we treat the page is
  write-flooding, however, we can speculative spte on many path, such as pte
  prefetch, page synced, that means the last speculative spte may be not point
  to the written page and the written page can be accessed via other sptes, so
  depends on the Accessed bit of the last speculative spte is not enough.

In this patchset, we fixed/avoided these issues:
- instead of filling the cache in page fault path, we do it in
  kvm_mmu_pte_write, and do not prefetch the spte if it dose not have free
  pte_list_desc object in the cache.

- if it is the repeat string instructions emulated and it is not a IO/MMIO
  access, we can zap all the corresponding shadow pages and return to the guest
  then, the mapping can became writable and directly write the page

- do not zap the shadow page if it only modify the last byte of pte.

- Instead of detected page accessed, we can detect whether the spte is accessed
  or not, if the spte is not accessed but it is written frequently, we treat is
  not a page table or it not used for a long time.

Performance test result:
the performance is obvious improved tested by kernebench:

Before patchset      After patchset
3m0.094s               2m50.177s
3m1.813s               2m52.774s
3m6.239                2m51.512

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html