On 11/9/2018 11:48 PM, Prakash Sangappa wrote: > On 9/24/18 10:14 AM, Michal Hocko wrote: >> On Fri 14-09-18 12:01:18, Steven Sistare wrote: >>> On 9/14/2018 1:56 AM, Michal Hocko wrote: >> [...] >>>> Why does this matter for something that is for analysis purposes. >>>> Reading the file for the whole address space is far from a free >>>> operation. Is the page walk optimization really essential for usability? >>>> Moreover what prevents move_pages implementation to be clever for the >>>> page walk itself? In other words why would we want to add a new API >>>> rather than make the existing one faster for everybody. >>> One could optimize move pages. If the caller passes a consecutive range >>> of small pages, and the page walk sees that a VA is mapped by a huge page, >>> then it can return the same numa node for each of the following VA's that fall >>> into the huge page range. It would be faster than 55 nsec per small page, but >>> hard to say how much faster, and the cost is still driven by the number of >>> small pages. >> This is exactly what I was arguing for. There is some room for >> improvements for the existing interface. I yet have to hear the explicit >> usecase which would required even better performance that cannot be >> achieved by the existing API. >> > > Above mentioned optimization to move_pages() API helps when scanning > mapped huge pages, but does not help if there are large sparse mappings > with few pages mapped. Otherwise, consider adding page walk support in > the move_pages() implementation, enhance the API(new flag?) to return > address range to numa node information. The page walk optimization > would certainly make a difference for usability. > > We can have applications(Like Oracle DB) having processes with large sparse > mappings(in TBs) with only some areas of these mapped address range > being accessed, basically large portions not having page tables backing it. > This can become more prevalent on newer systems with multiple TBs of > memory. > > Here is some data from pmap using move_pages() API with optimization. > Following table compares time pmap takes to print address mapping of a > large process, with numa node information using move_pages() api vs pmap > using /proc numa_vamaps file. > > Running pmap command on a process with 1.3 TB of address space, with > sparse mappings. > > ~1.3 TB sparse 250G dense segment with hugepages. > move_pages 8.33s 3.14 > optimized move_pages 6.29s 0.92 > /proc numa_vamaps 0.08s 0.04 > > > Second column is pmap time on a 250G address range of this process, which maps > hugepages(THP & hugetlb). The data look compelling to me. numa_vmap provides a much smoother user experience for the analyst who is casting a wide net looking for the root of a performance issue. Almost no waiting to see the data. - Steve