Re: [PATCH] x86 get_unmapped_area: Add PMD alignment for DAX PMD mmap

Toshi Kani <toshi.kani@xxxxxxx> · Wed, 06 Apr 2016 11:44:32 -0600

On Wed, 2016-04-06 at 12:50 -0400, Matthew Wilcox wrote:
> On Wed, Apr 06, 2016 at 07:58:09AM -0600, Toshi Kani wrote:
> > 
> > When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using PMD page
> > size.  This feature relies on both mmap virtual address and FS
> > block data (i.e. physical address) to be aligned by the PMD page
> > size.  Users can use mkfs options to specify FS to align block
> > allocations.  However, aligning mmap() address requires application
> > changes to mmap() calls, such as:
> > 
> >  -  /* let the kernel to assign a mmap addr */
> >  -  mptr = mmap(NULL, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);
> > 
> >  +  /* 1. obtain a PMD-aligned virtual address */
> >  +  ret = posix_memalign(&mptr, PMD_SIZE, fsize);
> >  +  if (!ret)
> >  +    free(mptr);  /* 2. release the virt addr */
> >  +
> >  +  /* 3. then pass the PMD-aligned virt addr to mmap() */
> >  +  mptr = mmap(mptr, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0);
> > 
> > These changes add unnecessary dependency to DAX and PMD page size
> > into application code.  The kernel should assign a mmap address
> > appropriate for the operation.
>
> I question the need for this patch.  Choosing an appropriate base address
> is the least of the changes needed for an application to take advantage
> of DAX.  

An application also needs to make sure that a given range [base -
base+size] is free in VMA.  The above example uses posix_memalign() to find
such a range, which in turn calls mmap() with size as (fsize + PMD_SIZE) in
this case.

> The NVML chooses appropriate addresses and gets a properly aligned
> address without any kernel code.

An application like NVML can continue to specify a specific address to
mmap().  Most existing applications, however, do not specify an address to
mmap().  With this patch, specifying an address will remain optional.

> > Change arch_get_unmapped_area() and arch_get_unmapped_area_topdown()
> > to request PMD_SIZE alignment when the request is for a DAX file and
> > its mapping range is large enough for using a PMD page.
>
> I think this is the wrong place for it, if we decide that this is the
> right thing to do.  The filesystem has a get_unmapped_area() which
> should be used instead.

Yes, I considered adding a filesystem entry point, but decided going this
way because:
 - arch_get_unmapped_area() and arch_get_unmapped_area_topdown() are arch-
specific code.  Therefore, this filesystem entry point will need arch-
specific implementation. 
 - There is nothing filesystem specific about requesting PMD alignment.

> > 
> > @@ -157,6 +157,13 @@ arch_get_unmapped_area(struct file *filp, unsigned
> > long addr,
> >  		info.align_mask = get_align_mask();
> >  		info.align_offset += get_align_bits();
> >  	}
> > +	if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) &&
> > IS_DAX(file_inode(filp))) {
>
> And there's never a need for the IS_ENABLED.  IS_DAX() compiles to '0' if
> CONFIG_FS_DAX is disabled.

CONFIG_FS_DAX_PMD can be disabled while CONFIG_FS_DAX is enabled.

> And where would this end?  Would you also change this code to look for
> 1GB entries if CONFIG_FS_DAX_PUD is enabled?  Far better to have this
> in the individual filesystem (probably calling a common helper in the DAX
> code).

Yes, it can be easily extended to support PUD.  This avoids another round
of application changes to align with the PUD size.

If the PUD support turns out to be filesystem specific, we may need a
capability bit in addition to CONFIG_FS_DAX_PUD.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>