Re: [PATCH v3 1/3] mm: introduce fincore()

Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> · Tue, 8 Jul 2014 16:41:32 -0400

On Tue, Jul 08, 2014 at 12:42:58PM -0700, Dave Hansen wrote:
> On 07/08/2014 12:03 PM, Naoya Horiguchi wrote:
> >> > The biggest question for me, though, is whether we want to start
> >> > designing these per-page interfaces to consider different page sizes, or
> >> > whether we're going to just continue to pretend that the entire world is
> >> > 4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
> >> > silly, for instance.
> > I didn't answer this question, sorry.
> > 
> > In my option, hugetlbfs pages should be handled as one hugepage (not as
> > many 4kB pages) to avoid lots of meaningless data transfer, as you pointed
> > out. And the current patch already works like that.
> 
> Just reading the code, I don't see any way that pc_shift gets passed
> down in to the do_fincore() loop.

No need to pass it down because operations over page cache tree use
page index internally to identify the in-file position and doesn't care
about page size. In 2MB hugetlbfs file, for example, index 1 means
byte offset 2MB (not offset 4kB.) So radix_tree_for_each_slot() runs
iter.index like 0 -> 1 -> 2 ... (instead of 0 -> 512 -> 1024 ...)

>  I don't see it getting reflected in
> to 'nr' or 'nr_pages' in there, and I can't see how:
> 
> 	jump = iter.index - fc->pgstart - nr;
> 
> can possibly be right since iter.index is being kept against the offset
> in the userspace buffer (4k pages) and 'nr' and fc->pgstart are
> essentially done in the huge page size.

... so all of iter.index, fc->pgstart, and nr is the same unit,
index (in hugepage size.) 
This is a pure index calculation, and do_fincore() is exactly the same
between 4kB pages and hugetlbfs pages.

> If you had a 2-page 1GB-hpage_size() hugetlbfs file, you would only have
> two pages in the radix tree, and only two iterations of
> radix_tree_for_each_slot().

Correct.

>  It would only set the first two bytes of a
> 256k BMAP buffer since only two pages were encountered in the radix tree.

Hmm, this example shows me a problem, thanks.

If the user knows the fd is for 1GB hugetlbfs file, it just prepares
the 2 bytes buffer, so no problem.
But if the user doesn't know whether the fd is from hugetlbfs file,
the user must prepare the large buffer, though only first few bytes
are used. And the more problematic is that the user could interpret
the data in buffer differently:
  1. only the first two 4kB-pages are loaded in the 2GB range,
  2. two 1GB-pages are loaded.
So for such callers, fincore() must notify the relevant page size
in some way on return.
Returning it via fincore_extra is my first thought but I'm not sure
if it's elegant enough.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>