Ping, anyone, please? Ben needs ack from any of MM people before proceeding with this patch. Thanks! On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote: > The current VFIO-on-POWER implementation supports only user mode > driven mapping, i.e. QEMU is sending requests to map/unmap pages. > However this approach is really slow, so we want to move that to KVM. > Since H_PUT_TCE can be extremely performance sensitive (especially with > network adapters where each packet needs to be mapped/unmapped) we chose > to implement that as a "fast" hypercall directly in "real > mode" (processor still in the guest context but MMU off). > > To be able to do that, we need to provide some facilities to > access the struct page count within that real mode environment as things > like the sparsemem vmemmap mappings aren't accessible. > > This adds an API to increment/decrement page counter as > get_user_pages API used for user mode mapping does not work > in the real mode. > > CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. > > Cc: linux-mm@xxxxxxxxx > Reviewed-by: Paul Mackerras <paulus@xxxxxxxxx> > Signed-off-by: Paul Mackerras <paulus@xxxxxxxxx> > Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > > --- > > Changes: > 2013/07/10: > * adjusted comment (removed sentence about virtual mode) > * get_page_unless_zero replaced with atomic_inc_not_zero to minimize > effect of a possible get_page_unless_zero() rework (if it ever happens). > > 2013/06/27: > * realmode_get_page() fixed to use get_page_unless_zero(). If failed, > the call will be passed from real to virtual mode and safely handled. > * added comment to PageCompound() in include/linux/page-flags.h. > > 2013/05/20: > * PageTail() is replaced by PageCompound() in order to have the same checks > for whether the page is huge in realmode_get_page() and realmode_put_page() > > Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > --- > arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ > arch/powerpc/mm/init_64.c | 76 +++++++++++++++++++++++++++++++- > include/linux/page-flags.h | 4 +- > 3 files changed, 82 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h > index 46db094..aa7b169 100644 > --- a/arch/powerpc/include/asm/pgtable-ppc64.h > +++ b/arch/powerpc/include/asm/pgtable-ppc64.h > @@ -394,6 +394,10 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array, > hpte_slot_array[index] = hidx << 4 | 0x1 << 3; > } > > +struct page *realmode_pfn_to_page(unsigned long pfn); > +int realmode_get_page(struct page *page); > +int realmode_put_page(struct page *page); > + > static inline char *get_hpte_slot_array(pmd_t *pmdp) > { > /* > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c > index d0cd9e4..dcbb806 100644 > --- a/arch/powerpc/mm/init_64.c > +++ b/arch/powerpc/mm/init_64.c > @@ -300,5 +300,79 @@ void vmemmap_free(unsigned long start, unsigned long end) > { > } > > -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ > +/* > + * We do not have access to the sparsemem vmemmap, so we fallback to > + * walking the list of sparsemem blocks which we already maintain for > + * the sake of crashdump. In the long run, we might want to maintain > + * a tree if performance of that linear walk becomes a problem. > + * > + * Any of realmode_XXXX functions can fail due to: > + * 1) As real sparsemem blocks do not lay in RAM continously (they > + * are in virtual address space which is not available in the real mode), > + * the requested page struct can be split between blocks so get_page/put_page > + * may fail. > + * 2) When huge pages are used, the get_page/put_page API will fail > + * in real mode as the linked addresses in the page struct are virtual > + * too. > + */ > +struct page *realmode_pfn_to_page(unsigned long pfn) > +{ > + struct vmemmap_backing *vmem_back; > + struct page *page; > + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; > + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); > > + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) { > + if (pg_va < vmem_back->virt_addr) > + continue; > + > + /* Check that page struct is not split between real pages */ > + if ((pg_va + sizeof(struct page)) > > + (vmem_back->virt_addr + page_size)) > + return NULL; > + > + page = (struct page *) (vmem_back->phys + pg_va - > + vmem_back->virt_addr); > + return page; > + } > + > + return NULL; > +} > +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); > + > +#elif defined(CONFIG_FLATMEM) > + > +struct page *realmode_pfn_to_page(unsigned long pfn) > +{ > + struct page *page = pfn_to_page(pfn); > + return page; > +} > +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); > + > +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ > + > +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) > +int realmode_get_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + if (!atomic_inc_not_zero(&page->_count)) > + return -EAGAIN; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_get_page); > + > +int realmode_put_page(struct page *page) > +{ > + if (PageCompound(page)) > + return -EAGAIN; > + > + if (!atomic_add_unless(&page->_count, -1, 1)) > + return -EAGAIN; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(realmode_put_page); > +#endif > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..98ada58 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page) > * System with lots of page flags available. This allows separate > * flags for PageHead() and PageTail() checks of compound pages so that bit > * tests can be used in performance sensitive paths. PageCompound is > - * generally not used in hot code paths. > + * generally not used in hot code paths except arch/powerpc/mm/init_64.c > + * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages > + * and avoid handling those in real mode. > */ > __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) > __PAGEFLAG(Tail, tail) > -- Alexey -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>