On Mon, Nov 22, 2021 at 04:00:19PM -0800, Mina Almasry wrote: > On Wed, Nov 17, 2021 at 4:34 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > Hi, Mina, > > > > On Wed, Nov 17, 2021 at 11:48:54AM -0800, Mina Almasry wrote: > > > Add PM_THP_MAPPED MAPPING to allow userspace to detect whether a given virt > > > address is currently mapped by a transparent huge page or not. Example > > > use case is a process requesting THPs from the kernel (via a huge tmpfs > > > mount for example), for a performance critical region of memory. The > > > userspace may want to query whether the kernel is actually backing this > > > memory by hugepages or not. > > > > > > PM_THP_MAPPED bit is set if the virt address is mapped at the PMD > > > level and the underlying page is a transparent huge page. > > > > > > A few options were considered: > > > 1. Add /proc/pid/pageflags that exports the same info as > > > /proc/kpageflags. This is not appropriate because many kpageflags are > > > inappropriate to expose to userspace processes. > > > 2. Simply get this info from the existing /proc/pid/smaps interface. > > > There are a couple of issues with that: > > > 1. /proc/pid/smaps output is human readable and unfriendly to > > > programatically parse. > > > 2. /proc/pid/smaps is slow. The cost of reading /proc/pid/smaps into > > > userspace buffers is about ~800us per call, and this doesn't > > > include parsing the output to get the information you need. The > > > cost of querying 1 virt address in /proc/pid/pagemaps however is > > > around 5-7us. > > > > This does not seem to be fair... Should the "800us" value relevant to the > > process memory size being mapped? As smaps requires walking the whole memory > > range and provides all stat info including THP accountings. > > > > While querying 1 virt address can only account 1 single THP at most. > > > > It means if we want to do all THP accounting for the whole range from pagemap > > we need multiple read()s, right? The fair comparison should compare the sum of > > all the read()s on the virt addr we care to a single smap call. > > > > And it's hard to be fair too, IMHO, because all these depend on size of mm. > > > > Smaps is, logically, faster because of two things: > > > > - Smaps needs only one syscall for whatever memory range (so one > > user->kernel->user switch). > > > > Comparing to pagemap use case of yours, you'll need to read 1 virt address > > for each PMD, so when the PMD mapped size is huge, it could turn out that > > smaps is faster even counting in the parsing time of smaps output. > > > > - Smaps knows how to handle things in PMD level without looping into PTEs: > > see smaps_pmd_entry(). > > > > Smaps will not duplicate the PMD entry into 512 PTE entries, because smaps > > is doing statistic (and that's exaxtly what your use case wants!), while > > pagemap is defined in 4K page size even for huge mappings because the > > structure is defined upon the offset of the pagemap fd; that's why it needs > > to duplicate the 2M entry into 512 same ones; even if they're not really so > > meaningful. > > > > That's also why I tried to propose the interface of smaps to allow querying > > partial of the memory range, because IMHO it solves the exact problem you're > > describing and it'll also be in the most efficient way, meanwhile I think it > > expose all the rest smaps data too besides THP accountings so it could help > > more than thp accountings. > > > > So again, no objection on this simple and working patch, but perhaps rephrasing > > the 2nd bullet as: "smaps is slow because it must read the whole memory range > > rather than a small range we care"? > > > > Sure thing, added in v7. > > If I'm coming across as resisting a range-based smaps, I don't mean > to. I think it addresses my use case. I'm just warning/pointing out > that: > 1. It'll be a large(r than 2 line) patch and probably an added kernel > interface, and I'm not sure my use case is an acceptable justification > to do that given the problem can be equally addressed very simply like > I'm adding here, but if it is an acceptable justification then I'm > fine with a range-based smaps. > 2. I'm not 100% sure what the performance would be like. But I can > protype it and let you know if I have any issues with the performance. > I just need to know what interface you're envisioning for this. Not sure whether an ioctl upon the smaps procfs file can work. > > I'll upload a v7 with the commit message change, and let me know what you think! Yeah that's the only thing I'm asking for now, and sorry to be picky. Thanks. -- Peter Xu