Re: [PATCH V2 0/6] VA to numa node information

"prakash.sangappa" <prakash.sangappa@xxxxxxxxxx> · Tue, 18 Dec 2018 15:46:45 -0800

    On 11/26/2018 11:20 AM, Steven Sistare
      wrote:

      On 11/9/2018 11:48 PM, Prakash Sangappa wrote:

        Here is some data from pmap using move_pages() API  with optimization.
Following table compares time pmap takes to print address mapping of a
large process, with numa node information using move_pages() api vs pmap
using /proc numa_vamaps file.

Running pmap command on a process with 1.3 TB of address space, with
sparse mappings.

                       ~1.3 TB sparse      250G dense segment with hugepages.
move_pages              8.33s              3.14
optimized move_pages    6.29s              0.92
/proc numa_vamaps       0.08s              0.04

Second column is pmap time on a 250G address range of this process, which maps
hugepages(THP & hugetlb).

      The data look compelling to me.  numa_vmap provides a much smoother user experience
for the analyst who is casting a wide net looking for the root of a performance issue.
Almost no waiting to see the data.

- Steve

    What do others think? How to proceed on this?

    Summarizing the discussion so far:

    Usecase for getting VA(Virtual Address) to numa node information is

    for performance
    analysis purpose. Investigating  performance issues

    would 
    involve looking at where a process memory is allocated from

    (which numa node). For the user analyzing the issue, an efficient
    way 

    to get this information will be useful when looking at application 

    processes having large address space.

    The patch proposed  adding /proc/<pid>/numa_vamaps file for
    providing

    VA to Numa node id mapping information of a process. This file
    provides 

    address range to numa node id info. Address range not having any
    pages 

    mapped will be indicated with '-' for numa node id. Sample file
    content

    00400000-00410000 N1
00410000-0047f000 N0
00480000-00481000 -
00481000-004a0000 N0
..

    Dave Hansen asked how would it scale, with respect reading this file
    from

    a large process. Answer is, the file contents are generated using
    page

    table walk, and copied to user buffer. The mmap_sem lock is drop and

    re-acquired in the process of walking the page table and copying
    file 

    content. The kernel buffer size used determines how long the lock is
    held. 

    Which can be further improved to drop the lock and re-acquire after
    a 

    fixed number(512) of pages are walked.

    Also, with support for seeking to a specific VA of the process from
    where

    the VA to numa node information will be provided, the file offset is
    not

    taken into consideration. This behavior is different and unlike
    reading a

    normal file. Other /proc files(Ex /proc/<pid>/pagemap) also
    have certain 

    differences compared to reading a normal file.

    Michal Hocko suggested that the currently available 'move_pages' API

    could be used to collect the VA to numa node id information.
    However,

    use of numa_vamaps /proc file will be more efficient then
    move_pages().

    Steven Sistare Suggested optimizing move_pages(), for the case when

    consecutive 4k page  addresses are passed in. I tried out this
    optimization 

    and above mentioned table shows  performance comparison of

    move_pages() API vs 'numa_vamaps' /proc file. Specifically, in the
    case of 

    sparse mapping the optimization to move_pages() does not help. The

    performance benefits seen with the /proc file will make a difference
    from 

    an usability point of view. 

    Andrew Morton had asked about the performance difference between 

    move_pages() API and use of 'numa_vamaps' /proc file, also the
    usecase 

    for getting VA to numa node id information. Hope above description

    answers the questions.