Hi! On Wed, Jul 24, 2013 at 01:30:12PM +0300, Gleb Natapov wrote: > On Wed, Jul 24, 2013 at 12:25:18PM +0200, Alexander Graf wrote: > > > > On 24.07.2013, at 12:19, Gleb Natapov wrote: > > > > > On Wed, Jul 24, 2013 at 12:09:42PM +0200, Alexander Graf wrote: > > >> > > >> On 24.07.2013, at 12:01, Gleb Natapov wrote: > > >> > > >>> Copying Andrea for him to verify that I am not talking nonsense :) > > >>> > > >>> On Wed, Jul 24, 2013 at 10:25:20AM +0200, Alexander Graf wrote: > > >>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > >>>>> index 1580dd4..5e8635b 100644 > > >>>>> --- a/virt/kvm/kvm_main.c > > >>>>> +++ b/virt/kvm/kvm_main.c > > >>>>> @@ -102,6 +102,10 @@ static bool largepages_enabled = true; > > >>>>> > > >>>>> bool kvm_is_mmio_pfn(pfn_t pfn) > > >>>>> { > > >>>>> +#ifdef CONFIG_MEMORY_HOTPLUG > > >>>> > > >>>> I'd feel safer if we narrow this down to e500. > > >>>> > > >>>>> + /* > > >>>>> + * Currently only in memory hot remove case we may still need this. > > >>>>> + */ > > >>>>> if (pfn_valid(pfn)) { > > >>>> > > >>>> We still have to check for pfn_valid, no? So the #ifdef should be down here. > > >>>> > > >>>>> int reserved; > > >>>>> struct page *tail = pfn_to_page(pfn); > > >>>>> @@ -124,6 +128,7 @@ bool kvm_is_mmio_pfn(pfn_t pfn) > > >>>>> } > > >>>>> return PageReserved(tail); > > >>>>> } > > >>>>> +#endif > > >>>>> > > >>>>> return true; > > >>>>> } > > >>>>> > > >>>>> Before apply this change: > > >>>>> > > >>>>> real (1m19.954s + 1m20.918s + 1m22.740s + 1m21.146s + 1m22.120s)/5= 1m21.376s > > >>>>> user (0m23.181s + 0m23.550s + 0m23.506s + 0m23.410s + 0m23.520s)/5= 0m23.433s > > >>>>> sys (0m49.087s + 0m49.563s + 0m51.758s + 0m50.290s + 0m51.047s)/5= 0m50.349s > > >>>>> > > >>>>> After apply this change: > > >>>>> > > >>>>> real (1m19.507s + 1m20.919s + 1m21.436s + 1m21.179s + 1m20.293s)/5= 1m20.667s > > >>>>> user (0m22.595s + 0m22.719s + 0m22.484s + 0m22.811s + 0m22.467s)/5= 0m22.615s > > >>>>> sys (0m48.841s + 0m49.929s + 0m50.310s + 0m49.813s + 0m48.587s)/5= 0m49.496s > > >>>>> > > >>>>> So, > > >>>>> > > >>>>> real (1m20.667s - 1m21.376s)/1m21.376s x 100% = -0.6% > > >>>>> user (0m22.615s - 0m23.433s)/0m23.433s x 100% = -3.5% > > >>>>> sys (0m49.496s - 0m50.349s)/0m50.349s x 100% = -1.7% > > >>>> > > >>>> Very nice, so there is a real world performance benefit to doing this. Then yes, I think it would make sense to change the global helper function to be fast on e500 and use that one from e500_shadow_mas2_attrib() instead. > > >>>> > > >>>> Gleb, Paolo, any hard feelings? > > >>>> > > >>> I do not see how can we break the function in such a way and get > > >>> away with it. Not all valid pfns point to memory. Physical address can > > >>> be sparse (due to PCI hole, framebuffer or just because). > > >> > > >> But we don't check for sparseness today in here either. We merely check for incomplete huge pages. > > >> > > > That's not how I read the code. The code checks for reserved flag set. > > > It should be set on pfns that point to memory holes. As far as I > > > > I couldn't find any traces of code that sets the reserved bits on e500 chips though. I've only seen it getting set for memory hotplug and memory incoherent DMA code which doesn't get used on e500. > > > > But I'd be more than happy to get proven wrong :). > > > Can you write a module that scans all page structures? AFAIK all pages > are marked as reserved and then those that become regular memory are > marked as unreserved. Hope Andrea will chime in here :) So the situation with regard to non-RAM and PageReserved/pfn_valid is quite simple. "struct page" exists for non-RAM too as "struct page" must exist up to at least 2^MAX_ORDER pfn alignment or things breaks, like the first pfn must be 2^MXA_ORDER aligned or again things break in the buddy. We don't make an effort to save a few "struct page" to keep it simpler. But those non-RAM pages (or tiny non-RAM page holes if any) are marked PageReserved. If "struct page" doesn't exist pfn_valid returns false. So you cannot get away skipping pfn_valid and at least one PageReserved. However it gets more complex than just ram vs non-RAM, because there are pages that are real RAM (not left marked PageReserved at boot after checking e820 or equivalent bios data for non-x86 archs) but that are taken over by drivers, than then could use it as mmio regions snooping the writes and mapping them in userland too as hugepages maybe. That is the motivation for the THP related code in kvm_is_mmio_pfn. Those vmas have VM_PFNMAP set so vm_normal_page is zero and the refcounting is skipped like if it's non-RAM and they're mapped with remap_pfn_range (different mechanism for VM_MIXEDMAP that does the refcounting and doesn't require in turn the driver to mark the page PageReserved). The above explains why KVM needs to skip the refcounting on PageReserved == true && pfn_valid() == true, and it must skip the refcounting for pfn_valid == false without trying to call pfn_to_page (or it'll crash). Now the code doing the THP check with smp_rmb is very safe, possibly too safe. Looking at it now, it looks a minor overengineering oversight. The slight oversight is that split_huge_page cannot transfer the PG_reserved bit from head to tail. So there's no real risk that the driver allocates an hugepage, marks the head reserved (the PG_ bits of a THP page are only relevant in the head), maps the page with some new version of remap_pfn_range_huge (not possible right now, PFNMAP|MIXEDMAP only can handle 4k mappings right now) and then split_huge_page runs and we miss the reserved bit on the tail page. Because the reserved bit wouldn't be transferred to the tail page anyway by split_huge_page so we'd miss it anyway if anything like that would happen. Besides split_huge_page couldn't run on a device owned page as it's not anonymous but device-owned and there's no way to map it with a hugepmd too. So in short, it's probably never going to help to have such a check there. We can probably optimize away the THP code in there. No matter how the driver maps this hypotetic new type of reserved hugepage in userland, it should never allow split_huge_page to run on it, and then it should take care of marking all subpages as reserved too. And KVM won't need to worry about a driver setting reserved only on a head page anymore. Untested RFC patch follows. == >From 76927680df7034a575bed5da754f7ebe94481fb3 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@xxxxxxxxxx> Date: Thu, 25 Jul 2013 02:56:08 +0200 Subject: [PATCH] kvm: optimize away THP checks in kvm_is_mmio_pfn() The checks on PG_reserved in the page structure on head and tail pages aren't necessary because split_huge_page wouldn't transfer the PG_reserved bit from head to tail anyway. This was a forward-thinking check done in the case PageReserved was set by a driver-owned page mapped in userland with something like remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not possible right now). It was meant to be very safe, but it's overkill as it's unlikely split_huge_page could ever run without the driver noticing and tearing down the hugepage itself. And if a driver in the future will really want to map a reserved hugepage in userland using an huge pmd it should simply take care of marking all subpages reserved too to keep KVM safe. This of course would require such a hypothetical driver to tear down the huge pmd itself and splitting the hugepage itself, instead of relaying on split_huge_page, but that sounds very reasonable, especially considering split_huge_page wouldn't currently transfer the reserved bit anyway. Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> --- virt/kvm/kvm_main.c | 24 ++---------------------- 1 file changed, 2 insertions(+), 22 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1580dd4..fa030fb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -102,28 +102,8 @@ static bool largepages_enabled = true; bool kvm_is_mmio_pfn(pfn_t pfn) { - if (pfn_valid(pfn)) { - int reserved; - struct page *tail = pfn_to_page(pfn); - struct page *head = compound_trans_head(tail); - reserved = PageReserved(head); - if (head != tail) { - /* - * "head" is not a dangling pointer - * (compound_trans_head takes care of that) - * but the hugepage may have been splitted - * from under us (and we may not hold a - * reference count on the head page so it can - * be reused before we run PageReferenced), so - * we've to check PageTail before returning - * what we just read. - */ - smp_rmb(); - if (PageTail(tail)) - return reserved; - } - return PageReserved(tail); - } + if (pfn_valid(pfn)) + return PageReserved(pfn_to_page(pfn)); return true; } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html