On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@xxxxxxxxxx> wrote: > On 09/07/22, 1:19 AM, "Nadav Amit" <namit@xxxxxxxxxx> wrote: > >> On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > >>> I have no misconceptions about whatever you want to call the mechanism >>> for communicating with the hypervisor at a higher level than "prod this >>> byte". For example, one of the more intensive things we use config >>> space for is sizing BARs. If we had a hypercall to siz a BAR, that >>> would eliminate: >>> >>> - Read current value from BAR >>> - Write all-ones to BAR >>> - Read new value from BAR >>> - Write original value back to BAR >>> >>> Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses. > > To improve further we can have following mechanism: > Map (as read only) the 'virtual device config i.e. 4KB ECAM' to > VM MMIO. VM will have direct read access using MMIO but > not using PIO. > > Virtual Machine test result with above mechanism: > 1 hundred thousand read using raw_pci_read() took: > PIO: 12.809 Sec. > MMIO: 0.010 Sec. > > And while VM booting, PCI scan and initialization time have been > reduced by ~65%. In our case it reduced to ~18 mSec from ~55 mSec. > > Thanks Matthew, for sharing history and your views on this patch. > > As you mentioned ordering change may impact some Hardware, so > it's better to have this change for VMware hypervisor or generic to > all hypervisor. I was chatting with Ajay, since I personally did not fully understand his use-case from the email. Others may have fully understood and can ignore this email. Here is a short summary of my understanding: During boot-time there are many PCI reads. Currently, when these reads are performed by a virtual machine, they all cause a VM-exit, and therefore each one of them induces a considerable overhead. When using MMIO (but not PIO), it is possible to map the PCI BARs of the virtual machine to some memory area that holds the values that the “emulated hardware” is supposed to return. The memory region is mapped as "read-only” in the NPT/EPT, so reads from these BAR regions would be treated as regular memory reads. Writes would still be trapped and emulated by the hypervisor. I have a vague recollection from some similar project that I had 10 years ago that this might not work for certain emulated device registers. For instance some hardware registers, specifically those the report hardware events, are “clear-on-read”. Apparently, Ajay took that into consideration. That is the reason for this quite amazing difference - several orders of magnitude - between the overhead that is caused by raw_pci_read(): 120us for PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would take 120us (I would have expected it to be 10 times faster, at least), but the benefit is quite clear.