Re: [PATCH v4 3/9] s390/mm: add gmap pmd invalidation notification

Janosch Frank <frankja@xxxxxxxxxxxxx> · Thu, 28 Jun 2018 15:47:27 +0200

On 28.06.2018 14:55, David Hildenbrand wrote:
> On 27.06.2018 15:55, Janosch Frank wrote:
>> From: Janosch Frank <frankja@xxxxxxxxxxxxxxxxxx>
>>
>> Like for ptes, we also need invalidation notification for pmds, to
>> remove the fake page tables when they are split and later addition of
>> shadowed pmds.
> 
> I think the subject should rather be
> 
> "s390/mm: split huge pages in GMAP when protecting"
> 
> It would be helpful to explain why we have to split huge pages when
> protecting. (complicated stuff we discussed). The pmdp_notify()
> introduction could be moved to a separate patch (and keep this subject).

I'll revise the commit message and have a look into splitting.

> 
> AFAICS, transparent huge page handling could be fairly easy, no? Do you
> know what exactly we are missing to make it work? (assuming
> CMMA=SKEY=PFMFI=OFF - so PGSTE don't matter)

I have not looked into THP, Martin has had a look.
One step at a time :)

> 
>>
>> With PMDs we do not have PGSTEs or some other bits we could use in the
>> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
>> time a host pmd will be invalidated, we will check if the respective
>> gmap PMD has the bit set and in that case fire up the notifier.
>>
>> Signed-off-by: Janosch Frank <frankja@xxxxxxxxxxxxxxxxxx>
[...]
>> @@ -63,6 +63,7 @@ static struct gmap *gmap_alloc(unsigned long limit)
>>  	INIT_LIST_HEAD(&gmap->crst_list);
>>  	INIT_LIST_HEAD(&gmap->children);
>>  	INIT_LIST_HEAD(&gmap->pt_list);
>> +	INIT_LIST_HEAD(&gmap->split_list);
>>  	INIT_RADIX_TREE(&gmap->guest_to_host, GFP_KERNEL);
>>  	INIT_RADIX_TREE(&gmap->host_to_guest, GFP_ATOMIC);
>>  	INIT_RADIX_TREE(&gmap->host_to_rmap, GFP_ATOMIC);
>> @@ -194,6 +195,10 @@ static void gmap_free(struct gmap *gmap)
>>  	gmap_radix_tree_free(&gmap->guest_to_host);
>>  	gmap_radix_tree_free(&gmap->host_to_guest);
>>  
>> +	/* Free split pmd page tables */
>> +	list_for_each_entry_safe(page, next, &gmap->split_list, lru)
>> +		page_table_free_pgste(page);
>> +
>>  	/* Free additional data for a shadow gmap */
>>  	if (gmap_is_shadow(gmap)) {
>>  		/* Free all page tables. */
>> @@ -599,10 +604,15 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>>  	if (*table == _SEGMENT_ENTRY_EMPTY) {
>>  		rc = radix_tree_insert(&gmap->host_to_guest,
>>  				       vmaddr >> PMD_SHIFT, table);
>> -		if (!rc)
>> -			*table = pmd_val(*pmd);
>> -	} else
>> -		rc = 0;
>> +		if (!rc) {
>> +			if (pmd_large(*pmd)) {
>> +				*table = pmd_val(*pmd) &
>> +					_SEGMENT_ENTRY_HARDWARE_BITS_LARGE;
>> +			} else
>> +				*table = pmd_val(*pmd) &
>> +					_SEGMENT_ENTRY_HARDWARE_BITS;
>> +		}
>> +	}
> 
> Does this part really belong into this patch *confused*

Hmm, I'll move this to the enablement patch where we also remove the
EFAULT on huge pmds.

> 
>>  	spin_unlock(&gmap->guest_table_lock);
>>  	spin_unlock(ptl);
>>  	radix_tree_preload_end();
>> @@ -833,7 +843,7 @@ static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr,
>>  }
>>  
>>  /**
>> - * gmap_pte_op_fixup - force a page in and connect the gmap page table
>> + * gmap_fixup - force memory in and connect the gmap table entry
>>   * @gmap: pointer to guest mapping meta data structure
>>   * @gaddr: virtual address in the guest address space
>>   * @vmaddr: address in the host process address space
>> @@ -841,10 +851,10 @@ static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr,
>>   *
>>   * Returns 0 if the caller can retry __gmap_translate (might fail again),
>>   * -ENOMEM if out of memory and -EFAULT if anything goes wrong while fixing
>> - * up or connecting the gmap page table.
>> + * up or connecting the gmap table entry.
>>   */
>> -static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr,
>> -			     unsigned long vmaddr, int prot)
>> +static int gmap_fixup(struct gmap *gmap, unsigned long gaddr,
>> +		      unsigned long vmaddr, int prot)
>>  {
>>  	struct mm_struct *mm = gmap->mm;
>>  	unsigned int fault_flags;
>> @@ -892,8 +902,11 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
>>  		return NULL;
>>  	}
>>  
>> -	/* 4k page table entries are locked via the pte (pte_alloc_map_lock). */
>> -	if (!pmd_large(*pmdp))
>> +	/*
>> +	 * Non-split 4k page table entries are locked via the pte
>> +	 * (pte_alloc_map_lock).
>> +	 */
>> +	if (!gmap_pmd_is_split(pmdp) && !pmd_large(*pmdp))
>>  		spin_unlock(&gmap->guest_table_lock);
>>  	return pmdp;
>>  }
>> @@ -905,10 +918,77 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
>>   */
>>  static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
>>  {
>> -	if (pmd_large(*pmdp))
>> +	if (pmd_large(*pmdp) || gmap_pmd_is_split(pmdp))
>>  		spin_unlock(&gmap->guest_table_lock);
>>  }
>>  
>> +static pte_t *gmap_pte_from_pmd(struct gmap *gmap, pmd_t *pmdp,
>> +				unsigned long addr, spinlock_t **ptl)
>> +{
>> +	if (likely(!gmap_pmd_is_split(pmdp)))
>> +		return pte_alloc_map_lock(gmap->mm, pmdp, addr, ptl);
>> +
>> +	*ptl = NULL;
>> +	return pte_offset_map(pmdp, addr);
>> +}
>> +
>> +/**
>> + * gmap_pmd_split_free - Free a split pmd's page table
>> + * @pmdp The split pmd that we free of its page table
>> + *
>> + * If the userspace pmds are exchanged, we'll remove the gmap pmds as
>> + * well, so we fault on them and link them again. We would leak
>> + * memory, if we didn't free split pmds here.
>> + */
>> +static inline void gmap_pmd_split_free(pmd_t *pmdp)
>> +{
>> +	unsigned long pgt = pmd_val(*pmdp) & _SEGMENT_ENTRY_ORIGIN;
>> +	struct page *page;
>> +
>> +	if (gmap_pmd_is_split(pmdp)) {
> 
> can this ever not be the case? This function is not used in this patch.

Look into the next one.

[...]
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
>> index 301e466e4263..7e1c17b1a24a 100644
>> --- a/arch/s390/mm/pgtable.c
>> +++ b/arch/s390/mm/pgtable.c
>> @@ -405,6 +405,8 @@ pmd_t pmdp_xchg_direct(struct mm_struct *mm, unsigned long addr,
>>  	pmd_t old;
>>  
>>  	preempt_disable();
>> +	if (mm_has_pgste(mm))
> 
> I am staring to wonder if mm_has_pgste(mm) is the right thing to check
> for. With huge pages we might even be able to start VMs completely
> without PGSTE. Right now this is an indication that "this is used by KVM"
> 
> Would something like "mm_has_gmap()" be me more clear?

Yes, after your allocate_pgste patch and considering hlp we should
rename the function and the mm context variable.

Still, such a patch will not be part of this patchset and I'd appreciate
it, if we could schedule it after its integration.

> 
>> +		pmdp_notify(mm, addr);
>>  	old = pmdp_flush_direct(mm, addr, pmdp);
>>  	*pmdp = new;
>>  	preempt_enable();
>> @@ -418,6 +420,8 @@ pmd_t pmdp_xchg_lazy(struct mm_struct *mm, unsigned long addr,
>>  	pmd_t old;
>>  
>>  	preempt_disable();
>> +	if (mm_has_pgste(mm))
>> +		pmdp_notify(mm, addr);
>>  	old = pmdp_flush_lazy(mm, addr, pmdp);
>>  	*pmdp = new;
>>  	preempt_enable();
>>
> 
> 

Attachment:
signature.asc

Description: OpenPGP digital signature