On 24.07.2019 17:10, Joel Fernandes wrote:> On Wed, Jul 24, 2019 at 01:28:42PM +0900, Minchan Kim wrote: >> On Tue, Jul 23, 2019 at 10:20:49AM -0400, Joel Fernandes wrote: >>> On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote: >>>> Hi Joel, >>>> >>>> On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote: >>>>> The page_idle tracking feature currently requires looking up the pagemap >>>>> for a process followed by interacting with /sys/kernel/mm/page_idle. >>>>> This is quite cumbersome and can be error-prone too. If between >>>> >>>> cumbersome: That's the fair tradeoff between idle page tracking and >>>> clear_refs because idle page tracking could check even though the page >>>> is not mapped. >>> >>> It is fair tradeoff, but could be made simpler. The userspace code got >>> reduced by a good amount as well. >>> >>>> error-prone: What's the error? >>> >>> We see in normal Android usage, that some of the times pages appear not to be >>> idle even when they really are idle. Reproducing this is a bit unpredictable >>> and happens at random occasions. With this new interface, we are seeing this >>> happen much much lesser. >> >> I don't know how you did test. Maybe that could be contributed by >> swapping out or shared pages touched by other processes or some kernel >> behavior not to keep access bit of their operation. > > It could be something along these lines is my thinking as well. So we know > its already has issues due to what you mentioned, I am not sure what else > needs investigation? > >> Please investigate more what's the root cause. That would be important >> point to justify for the patch motivation. > > The motivation is security. I am dropping the 'accuracy' factor I mentioned > from the patch description since it created a lot of confusion. If you are tracking idle working set for one process you could use degrading 'accuracy' for good - just don't walk page rmap and play only with access bits in one process. Foreign access could be detected with arbitrary delay, but this does not important if main goal is heap profiling. > >>>>> More over looking up PFN from pagemap in Android devices is not >>>>> supported by unprivileged process and requires SYS_ADMIN and gives 0 for >>>>> the PFN. >>>>> >>>>> This patch adds support to directly interact with page_idle tracking at >>>>> the PID level by introducing a /proc/<pid>/page_idle file. This >>>>> eliminates the need for userspace to calculate the mapping of the page. >>>>> It follows the exact same semantics as the global >>>>> /sys/kernel/mm/page_idle, however it is easier to use for some usecases >>>>> where looking up PFN is not needed and also does not require SYS_ADMIN. >>>> >>>> Ah, so the primary goal is to provide convinience interface and it would >>>> help accurary, too. IOW, accuracy is not your main goal? >>> >>> There are a couple of primary goals: Security, conveience and also solving >>> the accuracy/reliability problem we are seeing. Do keep in mind looking up >>> PFN has security implications. The PFN field in pagemap is zeroed if the user >>> does not have CAP_SYS_ADMIN. >> >> Myaybe you don't need PFN. is it? > > With the traditional idle tracking, PFN is needed which has the mentioned > security issues. This patch solves it. And the interface is identical and > familiar to the existing page_idle bitmap interface. > >>>>> In Android, we are using this for the heap profiler (heapprofd) which >>>>> profiles and pin points code paths which allocates and leaves memory >>>>> idle for long periods of time. >>>> >>>> So the goal is to detect idle pages with idle memory tracking? >>> >>> Isn't that what idle memory tracking does? >> >> To me, it's rather misleading. Please read motivation section in document. >> The feature would be good to detect workingset pages, not idle pages >> because workingset pages are never freed, swapped out and even we could >> count on newly allocated pages. >> >> Motivation >> ========== >> >> The idle page tracking feature allows to track which memory pages are being >> accessed by a workload and which are idle. This information can be useful for >> estimating the workload's working set size, which, in turn, can be taken into >> account when configuring the workload parameters, setting memory cgroup limits, >> or deciding where to place the workload within a compute cluster. > > As we discussed by chat, we could collect additional metadata to check if > pages were swapped or freed ever since the time we marked them as idle. > However this can be incremental improvement. > >>>> It couldn't work well because such idle pages could finally swap out and >>>> lose every flags of the page descriptor which is working mechanism of >>>> idle page tracking. It should have named "workingset page tracking", >>>> not "idle page tracking". >>> >>> The heap profiler that uses page-idle tracking is not to measure working set, >>> but to look for pages that are idle for long periods of time. >> >> It's important part. Please include it in the description so that people >> understands what's the usecase. As I said above, if it aims for finding >> idle pages durting the period, current idle page tracking feature is not >> good ironically. > > Ok, I will mention. > >>> Thanks for bringing up the swapping corner case.. Perhaps we can improve >>> the heap profiler to detect this by looking at bits 0-4 in pagemap. While it >> >> Yeb, that could work but it could add overhead again what you want to remove? >> Even, userspace should keep metadata to identify that page was already swapped >> in last period or newly swapped in new period. > > Yep. Between samples page could be read from swap and swapped out back multiple times. For tracking this swap ptes could be marked with idle bit too. I believe it's not so hard to find free bit for this. Refault\swapout will automatically clear this bit in pte even if page goes nowhere stays if swap-cache.