Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification

David Hildenbrand <david@xxxxxxxxxx> · Tue, 13 Feb 2018 15:36:45 +0100

On 09.02.2018 10:34, Janosch Frank wrote:
> For later migration of huge pages we want to write-protect guest
> PMDs. While doing this, we have to make absolutely sure, that the
> guest's lowcore is always accessible when the VCPU is running. With
> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
> the invalidation notification bit and kicking the guest out of the SIE
> via a notifier function if we need to invalidate such a page.
> 
> With PMDs we do not have PGSTEs or some other bits we could use in the
> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
> time a host pmd will be invalidated, we will check if the respective
> gmap PMD has the bit set and in that case fire up the notifier.
> 
> In the first step we only support setting the invalidation bit, but we
> do not support restricting access of guest pmds. It will follow
> shortly.
> 
> Signed-off-by: Janosch Frank <frankja@xxxxxxxxxxxxxxxxxx>

[...]

> +
> +/**
> + * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
> + * @gmap: pointer to guest mapping meta data structure
> + * @gaddr: virtual address in the guest address space
> + * @pmdp: pointer to the pmd that will be split
> + *
> + * When splitting gmap pmds, we have to make the resulting page table
> + * look like it's a normal one to be able to use the common pte
> + * handling functions. Also we need to track these new tables as they
> + * aren't tracked anywhere else.
> + */
> +static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +	pmd_t new;
> +	int i;
> +

That's interesting, because the SIE can now suddenly work on these
PGSTEs, e.g. not leading to intercepts on certain events (like setting
storage keys).

How is that intended to be handled? I assume we would somehow have to
forbid the SIE from making use of the PGSTE. But that involves clearing
certain interception controls, which might be problematic.

> +	page = page_table_alloc_pgste(gmap->mm);
> +	if (!page)
> +		return -ENOMEM;
> +	table = (unsigned long *) page_to_phys(page);
> +	for (i = 0; i < 256; i++) {
> +		table[i] = (pmd_val(*pmdp) & HPAGE_MASK) + i * PAGE_SIZE;
> +		/* pmd_large() implies pmd/pte_present() */
> +		table[i] |=  _PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE;
> +		/* ptes are directly marked as dirty */
> +		table[i + PTRS_PER_PTE] |= PGSTE_UC_BIT;
> +	}
> +
> +	pmd_val(new) = ((unsigned long)table | _SEGMENT_ENTRY |
> +			(_SEGMENT_ENTRY_GMAP_SPLIT));
> +	list_add(&page->lru, &gmap->split_list);
> +	gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
> +	return 0;
> +}
> +
>  /*
>   * gmap_protect_pte - remove access rights to memory and set pgste bits
>   * @gmap: pointer to guest mapping meta data structure
> @@ -941,7 +1021,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>  	spinlock_t *ptl = NULL;
>  	unsigned long pbits = 0;
>  
> -	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
> +	ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr, &ptl);
>  	if (!ptep)
>  		return -ENOMEM;
>  
> @@ -979,15 +1059,21 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  		rc = -EAGAIN;
>  		pmdp = gmap_pmd_op_walk(gmap, gaddr);
>  		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
> -			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> -					      bits);
> -			if (!rc) {
> -				len -= PAGE_SIZE;
> -				gaddr += PAGE_SIZE;
> +			if (!pmd_large(*pmdp)) {
> +				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> +						      bits);
> +				if (!rc) {
> +					len -= PAGE_SIZE;
> +					gaddr += PAGE_SIZE;
> +				}
> +			} else {
> +				rc = gmap_pmd_split(gmap, gaddr, pmdp);
> +				if (!rc)
> +					rc = -EFAULT;

Not sure if -EFAULT is the right thing to do here. Actually this would
be a much better use case for -EAGAIN.

(or simply avoid this by doing an gmap_pmd_op_end() + continue;)

>  			}
>  			gmap_pmd_op_end(gmap, pmdp);
>  		}
> -		if (rc) {
> +		if (rc && rc != -EFAULT) {
>  			vmaddr = __gmap_translate(gmap, gaddr);
>  			if (IS_ERR_VALUE(vmaddr))
>  				return vmaddr;
> @@ -2133,6 +2219,39 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
>  	spin_unlock(&sg->guest_table_lock);
>  }
>  
> +/*
> + * ptep_notify_gmap - call all invalidation callbacks for a specific pte of a gmap
> + * @mm: pointer to the process mm_struct
> + * @addr: virtual address in the process address space
> + * @pte: pointer to the page table entry
> + * @bits: bits from the pgste that caused the notify call
> + *
> + * This function is assumed to be called with the guest_table_lock held.
> + */
> +void ptep_notify_gmap(struct mm_struct *mm, unsigned long vmaddr,
> +		      pte_t *pte, unsigned long bits)
> +{
> +	unsigned long offset, gaddr = 0;
> +	unsigned long *table;
> +	struct gmap *gmap;
> +
> +	offset = ((unsigned long) pte) & (255 * sizeof(pte_t));
> +	offset = offset * (4096 / sizeof(pte_t));
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
> +		table = radix_tree_lookup(&gmap->host_to_guest,
> +					  vmaddr >> PMD_SHIFT);
> +		if (table)
> +			gaddr = __gmap_segment_gaddr(table) + offset;
> +		else
> +			continue;
> +
> +		if (bits & PGSTE_IN_BIT)
> +			gmap_call_notifier(gmap, gaddr, gaddr + PAGE_SIZE - 1);
> +	}
> +	rcu_read_unlock();
> +}
> +
>  /**
>   * ptep_notify - call all invalidation callbacks for a specific pte.
>   * @mm: pointer to the process mm_struct
> @@ -2177,6 +2296,23 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
>  }
>  EXPORT_SYMBOL_GPL(ptep_notify);
>  
> +static void pmdp_notify_split(struct mm_struct *mm, unsigned long vmaddr,
> +			      unsigned long *table)
> +{
> +	int i = 0;
> +	unsigned long bits;
> +	unsigned long *ptep = (unsigned long *)(*table & PAGE_MASK);
> +	unsigned long *pgste = ptep + PTRS_PER_PTE;
> +
> +	for (; i < 256; i++, vmaddr += PAGE_SIZE, ptep++, pgste++) {
> +		bits = *pgste & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
> +		if (bits) {
> +			*pgste ^= bits;
> +			ptep_notify_gmap(mm, vmaddr, (pte_t *)ptep, bits);

All existing callers of ptep_notify_gmap() are called with the pgste
lock being held. I guess we don't need that here as we always take the
guest_table_lock when working with split PMDs.

Complicated stuff :)

> +		}
> +	}
> +}

-- 

Thanks,

David / dhildenb
--
To unsubscribe from this list: send the line "unsubscribe linux-s390" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html