Matthew Wilcox <willy@xxxxxxxxxxxxx> writes: > On Fri, Feb 17, 2023 at 05:28:09PM +0530, Aneesh Kumar K V wrote: >> PowerPC architecture (POWER10) supports a Hot/Cold page tracking >> facility that provides access counter and access affinity details at >> configurable page size granularity [1]. I have been looking at using > > Does that advert contain any more information about this feature than: > > Hot/Cold page tracking | Recording for memory management > I will work with the hardware team to see if I can get a writeup done for use before the conference. But I am also interested in discussing things like who bears the cost of action based on hotness. Since a facility like this operates at the physical address range we may mostly be doing this outside the process context. For example, I could see the possibility of kpromoted which looks at the youngest generation in MGLRU and based on relative hotness move hot pages to the NUMA node from which there is frequent access. Should kpromoted do the migration? Or should it mark the pages migration ready (something like prot NUMA) and task on next access migrate the page? One of the other challenges I run into is determining the relative hotness. In most cases what we need is relative hotness not the absolute access count of a page. I also noticed that with the mongodb test, the performance varies a lot based on how we determine the relative hotness. > because I'd like to understand what its limitations are -- can > it be a per-VMA option, for example? Or is it set at bootup like > CONFIG_PAGE_SIZE? The hardware counters that are supported in the case of POWER10 are based on physical addresses. The hardware facility will count the access across a physical address range and there is a counter for each page that gives the access count and also information about which node did access the page. The page size is configurable and in POC I did use CONFIG_PAGE_SIZE. There is overhead in enabling/disabling the facility and I haven't looked at doing things like that in something like context switch granularity. Also, it monitors a physical address range and I am not sure how we can make that work for a VMA range or a task address space. > > For file-backed memory, the page cache will use variable sized > folios, depending on what it determines to be a useful granularity. > I'm _expecting_ something of the same sort for anonymous memory, although > maybe we'll make that determination on a per-VMA basis and make all > folios within a VMA the same size. -aneesh