On Thu, Aug 06, 2020 at 11:19:55AM +0200, Vitaly Kuznetsov wrote: > "Michael S. Tsirkin" <mst@xxxxxxxxxx> writes: > > > On Tue, Jul 28, 2020 at 04:37:38PM +0200, Vitaly Kuznetsov wrote: > >> This is a continuation of "[PATCH RFC 0/5] KVM: x86: KVM_MEM_ALLONES > >> memory" work: > >> https://lore.kernel.org/kvm/20200514180540.52407-1-vkuznets@xxxxxxxxxx/ > >> and pairs with Julia's "x86/PCI: Use MMCONFIG by default for KVM guests": > >> https://lore.kernel.org/linux-pci/20200722001513.298315-1-jusual@xxxxxxxxxx/ > >> > >> PCIe config space can (depending on the configuration) be quite big but > >> usually is sparsely populated. Guest may scan it by accessing individual > >> device's page which, when device is missing, is supposed to have 'pci > >> hole' semantics: reads return '0xff' and writes get discarded. > >> > >> When testing Linux kernel boot with QEMU q35 VM and direct kernel boot > >> I observed 8193 accesses to PCI hole memory. When such exit is handled > >> in KVM without exiting to userspace, it takes roughly 0.000001 sec. > >> Handling the same exit in userspace is six times slower (0.000006 sec) so > >> the overal; difference is 0.04 sec. This may be significant for 'microvm' > >> ideas. > >> > >> Note, the same speed can already be achieved by using KVM_MEM_READONLY > >> but doing this would require allocating real memory for all missing > >> devices and e.g. 8192 pages gives us 32mb. This will have to be allocated > >> for each guest separately and for 'microvm' use-cases this is likely > >> a no-go. > >> > >> Introduce special KVM_MEM_PCI_HOLE memory: userspace doesn't need to > >> back it with real memory, all reads from it are handled inside KVM and > >> return '0xff'. Writes still go to userspace but these should be extremely > >> rare. > >> > >> The original 'KVM_MEM_ALLONES' idea had additional optimizations: KVM > >> was mapping all 'PCI hole' pages to a single read-only page stuffed with > >> 0xff. This is omitted in this submission as the benefits are unclear: > >> KVM will have to allocate SPTEs (either on demand or aggressively) and > >> this also consumes time/memory. > > > > Curious about this: if we do it aggressively on the 1st fault, > > how long does it take to allocate 256 huge page SPTEs? > > And the amount of memory seems pretty small then, right? > > Right, this could work but we'll need a 2M region (one per KVM host of > course) filled with 0xff-s instead of a single 4k page. Given it's global doesn't sound too bad. > > Generally, I'd like to reach an agreement on whether this feature (and > the corresponding Julia's patch addding PV feature bit) is worthy. In > case it is (meaning it gets merged in this simplest form), we can > suggest further improvements. It would also help if firmware (SeaBIOS, > OVMF) would start recognizing the PV feature bit too, this way we'll be > seeing even bigger improvement and this may or may not be a deal-breaker > when it comes to the 'aggressive PTE mapping' idea. About the feature bit, I am not sure why it's really needed. A single mmio access is cheaper than two io accesses anyway, right? So it makes sense for a kvm guest whether host has this feature or not. We need to be careful and limit to a specific QEMU implementation to avoid tripping up bugs, but it seems more appropriate to check it using pci host IDs. > -- > Vitaly