Re: [PATCH 06/36] HMM: add HMM page table v2.

Mark Hairgrove <mhairgrove@xxxxxxxxxx> · Thu, 18 Jun 2015 19:06:08 -0700

On Thu, 21 May 2015, j.glisse@xxxxxxxxx wrote:

> From: Jérôme Glisse <jglisse@xxxxxxxxxx>
> 
> Heterogeneous memory management main purpose is to mirror a process address.
> To do so it must maintain a secondary page table that is use by the device
> driver to program the device or build a device specific page table.
> 
> Radix tree can not be use to create this secondary page table because HMM
> needs more flags than RADIX_TREE_MAX_TAGS (while this can be increase we
> believe HMM will require so much flags that cost will becomes prohibitive
> to others users of radix tree).
> 
> Moreover radix tree is built around long but for HMM we need to store dma
> address and on some platform sizeof(dma_addr_t) > sizeof(long). Thus radix
> tree is unsuitable to fulfill HMM requirement hence why we introduce this
> code which allows to create page table that can grow and shrink dynamicly.
> 
> The design is very clause to CPU page table as it reuse some of the feature

s/clause/close

> such as spinlock embedded in struct page.
> 
> Changed since v1:
>   - Use PAGE_SHIFT as shift value to reserve low bit for private device
>     specific flags. This is to allow device driver to use and some of the
>     lower bits for their own device specific purpose.
>   - Add set of helper for atomically clear, setting and testing bit on
>     dma_addr_t pointer. Atomicity being usefull only for dirty bit.
>   - Differentiate btw DMA mapped entry and non mapped entry (pfn).
>   - Split page directory entry and page table entry helpers.
> 
> Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx>
> Signed-off-by: Sherry Cheung <SCheung@xxxxxxxxxx>
> Signed-off-by: Subhash Gutti <sgutti@xxxxxxxxxx>
> Signed-off-by: Mark Hairgrove <mhairgrove@xxxxxxxxxx>
> Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx>
> Signed-off-by: Jatin Kumar <jakumar@xxxxxxxxxx>
> ---
>  MAINTAINERS            |   2 +
>  include/linux/hmm_pt.h | 380 +++++++++++++++++++++++++++++++++++++++++++
>  mm/Makefile            |   2 +-
>  mm/hmm_pt.c            | 425 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 808 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/hmm_pt.h
>  create mode 100644 mm/hmm_pt.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2f2a2be..8cd0aa7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4736,6 +4736,8 @@ L:	linux-mm@xxxxxxxxx
>  S:	Maintained
>  F:	mm/hmm.c
>  F:	include/linux/hmm.h
> +F:	mm/hmm_pt.c
> +F:	include/linux/hmm_pt.h
>  
>  HOST AP DRIVER
>  M:	Jouni Malinen <j@xxxxx>
> diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
> new file mode 100644
> index 0000000..330edb2
> --- /dev/null
> +++ b/include/linux/hmm_pt.h
> @@ -0,0 +1,380 @@
> +/*
> [...]
> +
> +static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
> +{
> +	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
> +}
> +
> +static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
> +{
> +	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
> +}
> +

Does hmm_pde_pfn return a dma_addr_t pfn or a system memory pfn?

The types between these two functions don't match. According to 
hmm_pde_from_pfn, both the pde and the pfn are supposed to be dma_addr_t. 
But hmm_pde_pfn returns an unsigned long as a pfn instead of a dma_addr_t. 
If hmm_pde_pfn sometimes used to get a dma_addr_t pfn then shouldn't it 
also return a dma_addr_t, since as you pointed out in the commit message, 
dma_addr_t might be bigger than an unsigned long?

> [...]
> +
> +static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
> +{
> +	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
> +}
> +
> +static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
> +{
> +	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
> +}

Same question as hmm_pde_pfn above.

> [...]
> +/* struct hmm_pt_iter - page table iterator states.
> + *
> + * @ptd: Array of directory struct page pointer for each levels.
> + * @ptdp: Array of pointer to mapped directory levels.
> + * @dead_directories: List of directories that died while walking page table.
> + * @cur: Current address.
> + */
> +struct hmm_pt_iter {
> +	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
> +	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];

These are sized to be HMM_PT_MAX_LEVEL - 1 rather than HMM_PT_MAX_LEVEL 
because the iterator doesn't store the top level, correct? This results in 
a lot of "level - 1" and "level - 2" logic when dealing with the iterator. 
Have you considered keeping the levels consistent to get rid of all the 
extra offset-by-1 logic?

> +	struct list_head	dead_directories;
> +	unsigned long		cur;
> +};
> [...]
> +
> +/* hmm_pt_protect_directory_unref() - reference a directory.

s/unref/ref

> + *
> + * @iter: Iterator states that currently protect the directory.
> + * @level: Level of the directory to reference.
> + *
> + * This function will reference a directory but it is illegal for refcount to
> + * be 0 as this helper should only be call when iterator is protecting the
> + * directory (ie iterator hold a reference for the directory).
> + *
> + * HMM user will call this with level = pt.llevel any other value is supicious
> + * outside of hmm_pt code.
> + */
> +static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
> +					     char level)
> +{
> +	/* Nothing to do for root level. */
> +	if (!level)
> +		return;
> +
> +	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
> +		/* Illegal this should not happen. */
> +		BUG();
> +}
> +
> [...]
> +
> +int hmm_pt_init(struct hmm_pt *pt)
> +{
> +	unsigned directory_shift, i = 0, npgd;
> +
> +	pt->last &= PAGE_MASK;
> +	spin_lock_init(&pt->lock);
> +	/* Directory shift is the number of bits that a single directory level
> +	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
> +	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
> +	 */
> +	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
> +	/* Level 0 is the root level of the page table. It might use less
> +	 * bits than directory_shift but all sub-directory level will use all
> +	 * directory_shift bits.
> +	 *
> +	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and

This example should say that hmm_pt.last == (1 << 48) - 1, since last is 
inclusive. Otherwise llevel will be 4.

> +	 * sizeof(dma_addr_t) == 8 then :
> +	 *   directory_shift = 9
> +	 *   shift[0] = 39
> +	 *   shift[1] = 30
> +	 *   shift[2] = 21
> +	 *   shift[3] = 12
> +	 *   llevel = 3
> +	 *
> +	 * Note that shift[llevel] == PAGE_SHIFT because the last level
> +	 * correspond to the page table entry level (ignoring the case of huge
> +	 * page).
> +	 */
> +	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
> +			directory_shift) + PAGE_SHIFT;
> +	while (pt->shift[i++] > PAGE_SHIFT)
> +		pt->shift[i] = pt->shift[i - 1] - directory_shift;
> +	pt->llevel = i - 1;
> +	pt->directory_mask = (1 << directory_shift) - 1;
> +
> +	for (i = 0; i <= pt->llevel; ++i)
> +		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
> +
> +	npgd = (pt->last >> pt->shift[0]) + 1;
> +	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
> +	if (!pt->pgd)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_pt_init);

Does this need to be EXPORT_SYMBOL? It seems like a driver would never 
need to call this, only core hmm. Same question for hmm_pt_fini.

> [...]
> +
> +/* hmm_pt_init() - initialize iterator states.

This should say hmm_pt_iter_init.

> + *
> + * @iter: Iterator states.
> + *
> + * This function will initialize iterator states. It must always be pair with a
> + * call to hmm_pt_iter_fini().
> + */
> +void hmm_pt_iter_init(struct hmm_pt_iter *iter)
> +{
> +	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
> +	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));

The memset sizes can simply be sizeof(iter->ptd) and sizeof(iter->ptdp).

> +	INIT_LIST_HEAD(&iter->dead_directories);
> +}
> +EXPORT_SYMBOL(hmm_pt_iter_init);
> +
> +/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
> + *
> + * @iter: Iterator states.
> + * @pt: HMM page table.
> + * @level: Level of the directory to unref.
> + *
> + * This function will unreference a directory and add it to dead list if
> + * directory no longer have any reference. It will also clear the entry to
> + * that directory into the upper level directory as well as dropping ref
> + * on the upper directory.
> + */
> +static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
> +					     struct hmm_pt *pt,
> +					     unsigned level)
> +{
> +	struct page *upper_ptd;
> +	dma_addr_t *upper_ptdp;
> +
> +	/* Nothing to do for root level. */
> +	if (!level)
> +		return;
> +
> +	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
> +		return;
> +
> +	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
> +	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
> +	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
> +	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
> +	/*
> +	 * There might be race btw decrementing reference count on a directory
> +	 * and another thread trying to fault in a new directory. To avoid
> +	 * erasing the new directory entry we need to check that the entry
> +	 * still correspond to the directory we are removing.
> +	 */
> +	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
> +		*upper_ptdp = 0;
> +	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
> +
> +	/* Add it to delayed free list. */
> +	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
> +
> +	/*
> +	 * The upper directory is not safe to unref as we have an extra ref and

This should be "IS safe to unref", correct?

> +	 * thus refcount should not reach 0.
> +	 */
> +	hmm_pt_iter_directory_unref(iter, level - 1);
> +}
> [...]
> +/* hmm_pt_iter_fini() - finalize iterator.
> + *
> + * @iter: Iterator states.
> + * @pt: HMM page table.
> + *
> + * This function will cleanup iterator by unmapping and unreferencing any
> + * directory still mapped and referenced. It will also free any dead directory.
> + */
> +void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt)
> +{
> +	struct page *ptd, *tmp;
> +	unsigned i;
> +
> +	for (i = pt->llevel; i >= 1; --i) {
> +		if (!iter->ptd[i - 1])
> +			continue;
> +		hmm_pt_iter_unprotect_directory(iter, pt, i);
> +	}
> +
> +	/* Avoid useless synchronize_rcu() if there is no directory to free. */
> +	if (list_empty(&iter->dead_directories))
> +		return;
> +
> +	/*
> +	 * Some iterator may have dereferenced a dead directory entry and looked
> +	 * up the struct page but haven't check yet the reference count. As all
> +	 * the above happen in rcu read critical section we know that we need
> +	 * to wait for grace period before being able to free any of the dead
> +	 * directory page.
> +	 */
> +	synchronize_rcu();
> +	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
> +		list_del(&ptd->lru);
> +		atomic_set(&ptd->_mapcount, -1);
> +		__free_page(ptd);
> +	}
> +}

If I'm following this correctly, a migrate to the device will allocate HMM 
page tables and the subsequent migrate from the device will free them. 
Assuming that's the case, might thrashing of page allocations be a 
problem? What about keeping the HMM page tables around until the actual 
munmap() of the corresponding VA range?