On Sun, 19 Jul 2015 15:31:09 +0300 Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> wrote: > Hi, > > This patch set introduces a new user API for tracking user memory pages > that have not been used for a given period of time. The purpose of this > is to provide the userspace with the means of tracking a workload's > working set, i.e. the set of pages that are actively used by the > workload. Knowing the working set size can be useful for partitioning > the system more efficiently, e.g. by tuning memory cgroup limits > appropriately, or for job placement within a compute cluster. > > It is based on top of v4.2-rc2-mmotm-2015-07-15-16-46 > It applies without conflicts to v4.2-rc2-mmotm-2015-07-17-16-04 as well > > ---- USE CASES ---- > > The unified cgroup hierarchy has memory.low and memory.high knobs, which > are defined as the low and high boundaries for the workload working set > size. However, the working set size of a workload may be unknown or > change in time. With this patch set, one can periodically estimate the > amount of memory unused by each cgroup and tune their memory.low and > memory.high parameters accordingly, therefore optimizing the overall > memory utilization. > > Another use case is balancing workloads within a compute cluster. > Knowing how much memory is not really used by a workload unit may help > take a more optimal decision when considering migrating the unit to > another node within the cluster. > > Also, as noted by Minchan, this would be useful for per-process reclaim > (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle > pages only by smart user memory manager. > > ---- USER API ---- > > The user API consists of two new proc files: > > * /proc/kpageidle. This file implements a bitmap where each bit corresponds > to a page, indexed by PFN. What are the bit mappings? If I read the first byte of /proc/kpageidle I get PFN #0 in bit zero of that byte? And the second byte of /proc/kpageidle contains PFN #8 in its LSB, etc? Maybe this is covered in the documentation file. > When the bit is set, the corresponding page is > idle. A page is considered idle if it has not been accessed since it was > marked idle. Perhaps we can spell out in some detail what "accessed" means? I see you've hooked into mark_page_accessed(), so a read from disk is an access. What about a write to disk? And what about a page being accessed from some random device (could hook into get_user_pages()?) Is getting written to swap an access? When a dirty pagecache page is written out by kswapd or direct reclaim? This also should be in the permanent documentation. > To mark a page idle one should set the bit corresponding to the > page by writing to the file. A value written to the file is OR-ed with the > current bitmap value. Only user memory pages can be marked idle, for other > page types input is silently ignored. Writing to this file beyond max PFN > results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is > set. > > This file can be used to estimate the amount of pages that are not > used by a particular workload as follows: > > 1. mark all pages of interest idle by setting corresponding bits in the > /proc/kpageidle bitmap > 2. wait until the workload accesses its working set > 3. read /proc/kpageidle and count the number of bits set Security implications. This interface could be used to learn about a sensitive application by poking data at it and then observing its memory access patterns. Perhaps this is why the proc files are root-only (whcih I assume is sufficient). Some words here about the security side of things and the reasoning behind the chosen permissions would be good to have. > * /proc/kpagecgroup. This file contains a 64-bit inode number of the > memory cgroup each page is charged to, indexed by PFN. Actually "closest online ancestor". This also should be in the interface documentation. > Only available when CONFIG_MEMCG is set. CONFIG_MEMCG and CONFIG_IDLE_PAGE_TRACKING I assume? > > This file can be used to find all pages (including unmapped file > pages) accounted to a particular cgroup. Using /proc/kpageidle, one > can then estimate the cgroup working set size. > > For an example of using these files for estimating the amount of unused > memory pages per each memory cgroup, please see the script attached > below. Why were these put in /proc anyway? Rather than under /sys/fs/cgroup somewhere? Presumably because /proc/kpageidle is useful in non-memcg setups. > ---- PERFORMANCE EVALUATION ---- "^___" means "end of changelog". Perhaps that should have been "^---\n" - unclear. > Documentation/vm/pagemap.txt | 22 ++- I think we'll need quite a lot more than this to fully describe the interface? -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html