On 28.06.2018 14:55, David Hildenbrand wrote: > On 27.06.2018 15:55, Janosch Frank wrote: >> From: Janosch Frank <frankja@xxxxxxxxxxxxxxxxxx> >> >> Like for ptes, we also need invalidation notification for pmds, to >> remove the fake page tables when they are split and later addition of >> shadowed pmds. > > I think the subject should rather be > > "s390/mm: split huge pages in GMAP when protecting" > > It would be helpful to explain why we have to split huge pages when > protecting. (complicated stuff we discussed). The pmdp_notify() > introduction could be moved to a separate patch (and keep this subject). I'll revise the commit message and have a look into splitting. > > AFAICS, transparent huge page handling could be fairly easy, no? Do you > know what exactly we are missing to make it work? (assuming > CMMA=SKEY=PFMFI=OFF - so PGSTE don't matter) I have not looked into THP, Martin has had a look. One step at a time :) > >> >> With PMDs we do not have PGSTEs or some other bits we could use in the >> host PMD. Instead we pick one of the free bits in the gmap PMD. Every >> time a host pmd will be invalidated, we will check if the respective >> gmap PMD has the bit set and in that case fire up the notifier. >> >> Signed-off-by: Janosch Frank <frankja@xxxxxxxxxxxxxxxxxx> [...] >> @@ -63,6 +63,7 @@ static struct gmap *gmap_alloc(unsigned long limit) >> INIT_LIST_HEAD(&gmap->crst_list); >> INIT_LIST_HEAD(&gmap->children); >> INIT_LIST_HEAD(&gmap->pt_list); >> + INIT_LIST_HEAD(&gmap->split_list); >> INIT_RADIX_TREE(&gmap->guest_to_host, GFP_KERNEL); >> INIT_RADIX_TREE(&gmap->host_to_guest, GFP_ATOMIC); >> INIT_RADIX_TREE(&gmap->host_to_rmap, GFP_ATOMIC); >> @@ -194,6 +195,10 @@ static void gmap_free(struct gmap *gmap) >> gmap_radix_tree_free(&gmap->guest_to_host); >> gmap_radix_tree_free(&gmap->host_to_guest); >> >> + /* Free split pmd page tables */ >> + list_for_each_entry_safe(page, next, &gmap->split_list, lru) >> + page_table_free_pgste(page); >> + >> /* Free additional data for a shadow gmap */ >> if (gmap_is_shadow(gmap)) { >> /* Free all page tables. */ >> @@ -599,10 +604,15 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr) >> if (*table == _SEGMENT_ENTRY_EMPTY) { >> rc = radix_tree_insert(&gmap->host_to_guest, >> vmaddr >> PMD_SHIFT, table); >> - if (!rc) >> - *table = pmd_val(*pmd); >> - } else >> - rc = 0; >> + if (!rc) { >> + if (pmd_large(*pmd)) { >> + *table = pmd_val(*pmd) & >> + _SEGMENT_ENTRY_HARDWARE_BITS_LARGE; >> + } else >> + *table = pmd_val(*pmd) & >> + _SEGMENT_ENTRY_HARDWARE_BITS; >> + } >> + } > > Does this part really belong into this patch *confused* Hmm, I'll move this to the enablement patch where we also remove the EFAULT on huge pmds. > >> spin_unlock(&gmap->guest_table_lock); >> spin_unlock(ptl); >> radix_tree_preload_end(); >> @@ -833,7 +843,7 @@ static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr, >> } >> >> /** >> - * gmap_pte_op_fixup - force a page in and connect the gmap page table >> + * gmap_fixup - force memory in and connect the gmap table entry >> * @gmap: pointer to guest mapping meta data structure >> * @gaddr: virtual address in the guest address space >> * @vmaddr: address in the host process address space >> @@ -841,10 +851,10 @@ static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr, >> * >> * Returns 0 if the caller can retry __gmap_translate (might fail again), >> * -ENOMEM if out of memory and -EFAULT if anything goes wrong while fixing >> - * up or connecting the gmap page table. >> + * up or connecting the gmap table entry. >> */ >> -static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr, >> - unsigned long vmaddr, int prot) >> +static int gmap_fixup(struct gmap *gmap, unsigned long gaddr, >> + unsigned long vmaddr, int prot) >> { >> struct mm_struct *mm = gmap->mm; >> unsigned int fault_flags; >> @@ -892,8 +902,11 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr) >> return NULL; >> } >> >> - /* 4k page table entries are locked via the pte (pte_alloc_map_lock). */ >> - if (!pmd_large(*pmdp)) >> + /* >> + * Non-split 4k page table entries are locked via the pte >> + * (pte_alloc_map_lock). >> + */ >> + if (!gmap_pmd_is_split(pmdp) && !pmd_large(*pmdp)) >> spin_unlock(&gmap->guest_table_lock); >> return pmdp; >> } >> @@ -905,10 +918,77 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr) >> */ >> static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp) >> { >> - if (pmd_large(*pmdp)) >> + if (pmd_large(*pmdp) || gmap_pmd_is_split(pmdp)) >> spin_unlock(&gmap->guest_table_lock); >> } >> >> +static pte_t *gmap_pte_from_pmd(struct gmap *gmap, pmd_t *pmdp, >> + unsigned long addr, spinlock_t **ptl) >> +{ >> + if (likely(!gmap_pmd_is_split(pmdp))) >> + return pte_alloc_map_lock(gmap->mm, pmdp, addr, ptl); >> + >> + *ptl = NULL; >> + return pte_offset_map(pmdp, addr); >> +} >> + >> +/** >> + * gmap_pmd_split_free - Free a split pmd's page table >> + * @pmdp The split pmd that we free of its page table >> + * >> + * If the userspace pmds are exchanged, we'll remove the gmap pmds as >> + * well, so we fault on them and link them again. We would leak >> + * memory, if we didn't free split pmds here. >> + */ >> +static inline void gmap_pmd_split_free(pmd_t *pmdp) >> +{ >> + unsigned long pgt = pmd_val(*pmdp) & _SEGMENT_ENTRY_ORIGIN; >> + struct page *page; >> + >> + if (gmap_pmd_is_split(pmdp)) { > > can this ever not be the case? This function is not used in this patch. Look into the next one. [...] >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c >> index 301e466e4263..7e1c17b1a24a 100644 >> --- a/arch/s390/mm/pgtable.c >> +++ b/arch/s390/mm/pgtable.c >> @@ -405,6 +405,8 @@ pmd_t pmdp_xchg_direct(struct mm_struct *mm, unsigned long addr, >> pmd_t old; >> >> preempt_disable(); >> + if (mm_has_pgste(mm)) > > I am staring to wonder if mm_has_pgste(mm) is the right thing to check > for. With huge pages we might even be able to start VMs completely > without PGSTE. Right now this is an indication that "this is used by KVM" > > Would something like "mm_has_gmap()" be me more clear? Yes, after your allocate_pgste patch and considering hlp we should rename the function and the mm context variable. Still, such a patch will not be part of this patchset and I'd appreciate it, if we could schedule it after its integration. > >> + pmdp_notify(mm, addr); >> old = pmdp_flush_direct(mm, addr, pmdp); >> *pmdp = new; >> preempt_enable(); >> @@ -418,6 +420,8 @@ pmd_t pmdp_xchg_lazy(struct mm_struct *mm, unsigned long addr, >> pmd_t old; >> >> preempt_disable(); >> + if (mm_has_pgste(mm)) >> + pmdp_notify(mm, addr); >> old = pmdp_flush_lazy(mm, addr, pmdp); >> *pmdp = new; >> preempt_enable(); >> > >
Attachment:
signature.asc
Description: OpenPGP digital signature