Re: [PATCH v6] mm: Add PM_THP_MAPPED to /proc/pid/pagemap

Peter Xu <peterx@xxxxxxxxxx> · Tue, 23 Nov 2021 08:29:38 +0800

On Mon, Nov 22, 2021 at 04:00:19PM -0800, Mina Almasry wrote:
> On Wed, Nov 17, 2021 at 4:34 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > Hi, Mina,
> >
> > On Wed, Nov 17, 2021 at 11:48:54AM -0800, Mina Almasry wrote:
> > > Add PM_THP_MAPPED MAPPING to allow userspace to detect whether a given virt
> > > address is currently mapped by a transparent huge page or not.  Example
> > > use case is a process requesting THPs from the kernel (via a huge tmpfs
> > > mount for example), for a performance critical region of memory.  The
> > > userspace may want to query whether the kernel is actually backing this
> > > memory by hugepages or not.
> > >
> > > PM_THP_MAPPED bit is set if the virt address is mapped at the PMD
> > > level and the underlying page is a transparent huge page.
> > >
> > > A few options were considered:
> > > 1. Add /proc/pid/pageflags that exports the same info as
> > >    /proc/kpageflags.  This is not appropriate because many kpageflags are
> > >    inappropriate to expose to userspace processes.
> > > 2. Simply get this info from the existing /proc/pid/smaps interface.
> > >    There are a couple of issues with that:
> > >    1. /proc/pid/smaps output is human readable and unfriendly to
> > >       programatically parse.
> > >    2. /proc/pid/smaps is slow.  The cost of reading /proc/pid/smaps into
> > >       userspace buffers is about ~800us per call, and this doesn't
> > >       include parsing the output to get the information you need. The
> > >       cost of querying 1 virt address in /proc/pid/pagemaps however is
> > >       around 5-7us.
> >
> > This does not seem to be fair...  Should the "800us" value relevant to the
> > process memory size being mapped?  As smaps requires walking the whole memory
> > range and provides all stat info including THP accountings.
> >
> > While querying 1 virt address can only account 1 single THP at most.
> >
> > It means if we want to do all THP accounting for the whole range from pagemap
> > we need multiple read()s, right?  The fair comparison should compare the sum of
> > all the read()s on the virt addr we care to a single smap call.
> >
> > And it's hard to be fair too, IMHO, because all these depend on size of mm.
> >
> > Smaps is, logically, faster because of two things:
> >
> >   - Smaps needs only one syscall for whatever memory range (so one
> >     user->kernel->user switch).
> >
> >     Comparing to pagemap use case of yours, you'll need to read 1 virt address
> >     for each PMD, so when the PMD mapped size is huge, it could turn out that
> >     smaps is faster even counting in the parsing time of smaps output.
> >
> >   - Smaps knows how to handle things in PMD level without looping into PTEs:
> >     see smaps_pmd_entry().
> >
> >     Smaps will not duplicate the PMD entry into 512 PTE entries, because smaps
> >     is doing statistic (and that's exaxtly what your use case wants!), while
> >     pagemap is defined in 4K page size even for huge mappings because the
> >     structure is defined upon the offset of the pagemap fd; that's why it needs
> >     to duplicate the 2M entry into 512 same ones; even if they're not really so
> >     meaningful.
> >
> > That's also why I tried to propose the interface of smaps to allow querying
> > partial of the memory range, because IMHO it solves the exact problem you're
> > describing and it'll also be in the most efficient way, meanwhile I think it
> > expose all the rest smaps data too besides THP accountings so it could help
> > more than thp accountings.
> >
> > So again, no objection on this simple and working patch, but perhaps rephrasing
> > the 2nd bullet as: "smaps is slow because it must read the whole memory range
> > rather than a small range we care"?
> >
> 
> Sure thing, added in v7.
> 
> If I'm coming across as resisting a range-based smaps, I don't mean
> to. I think it addresses my use case. I'm just warning/pointing out
> that:
> 1. It'll be a large(r than 2 line) patch and probably an added kernel
> interface, and I'm not sure my use case is an acceptable justification
> to do that given the problem can be equally addressed very simply like
> I'm adding here, but if it is an acceptable justification then I'm
> fine with a range-based smaps.
> 2. I'm not 100% sure what the performance would be like. But I can
> protype it and let you know if I have any issues with the performance.
> I just need to know what interface you're envisioning for this.

Not sure whether an ioctl upon the smaps procfs file can work.

> 
> I'll upload a v7 with the commit message change, and let me know what you think!

Yeah that's the only thing I'm asking for now, and sorry to be picky.  Thanks.

-- 
Peter Xu