Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages

"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> · Sun, 19 Feb 2023 20:13:35 +0530

Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:

> On Fri, Feb 17, 2023 at 05:28:09PM +0530, Aneesh Kumar K V wrote:
>> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
>> facility that provides access counter and access affinity details at
>> configurable page size granularity [1]. I have been looking at using
>
> Does that advert contain any more information about this feature than:
>
> 	Hot/Cold page tracking | Recording for memory management
>

I will work with the hardware team to see if I can get a writeup done
for use before the conference. But I am also interested in discussing
things like who bears the cost of action based on hotness. Since a
facility like this operates at the physical address range we may mostly
be doing this outside the process context. For example, I could see the
possibility of kpromoted which looks at the youngest generation in MGLRU
and based on relative hotness move hot pages to the NUMA node from which
there is frequent access. Should kpromoted do the migration? Or should
it mark the pages migration ready (something like prot NUMA) and task on
next access migrate the page?

One of the other challenges I run into is determining the relative
hotness. In most cases what we need is relative hotness not the absolute
access count of a page. I also noticed that with the mongodb test, the
performance varies a lot based on how we determine the relative hotness.

> because I'd like to understand what its limitations are -- can
> it be a per-VMA option, for example?  Or is it set at bootup like
> CONFIG_PAGE_SIZE?

The hardware counters that are supported in the case of POWER10 are
based on physical addresses. The hardware facility will count the access
across a physical address range and there is a counter for each page
that gives the access count and also information about which node did
access the page. The page size is configurable and in POC I did use
CONFIG_PAGE_SIZE. There is overhead in enabling/disabling the facility
and I haven't looked at doing things like that in something like context
switch granularity. Also, it monitors a physical address range and I am
not sure how we can make that work for a VMA range or a task address
space.

>
> For file-backed memory, the page cache will use variable sized
> folios, depending on what it determines to be a useful granularity.
> I'm _expecting_ something of the same sort for anonymous memory, although
> maybe we'll make that determination on a per-VMA basis and make all
> folios within a VMA the same size.

-aneesh