Re: [RFC v1] mm: add page preemption

Hillf Danton <hdanton@xxxxxxxx> · Tue, 22 Oct 2019 20:14:39 +0800

On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> 
> On Sun 20-10-19 21:43:04, Hillf Danton wrote:
> > 
> > Unlike cpu preemption, page preemption would have been a two-edge
> > option for quite a while. It is added by preventing tasks from
> > reclaiming as many lru pages as possible from other tasks of
> > higher priorities.
> 
> This really begs for more explanation.
> I would start with what page preemption actually is.

Page preemption is the system-provided facility that makes a task
able to preempt other tasks of lower priorities for page in the
same manner as for cpu.

> Why do we care and which workloads would benefit and how much.

Page preemption, disabled by default, should be turned on by those
who wish that the performance of their workloads can survive memory
pressure to certain extent.

The number of pp users is supposed near the people who change the
nice value of their apps either to -1 or higher at least once a week,
less than vi users among UK's undergraduates.

> And last but not least why the existing infrastructure doesn't help
> (e.g. if you have clearly defined workloads with different
> memory consumption requirements then why don't you use memory cgroups to
> reflect the priority).

Good question:)

Though pp is implemented by preventing any task from reclaiming as many
pages as possible from other tasks that are higher on priority, it is
trying to introduce prio into page reclaiming, to add a feature.

Page and memcg are different objects after all; pp is being added at
the page granularity. It should be an option available in environments
without memcg enabled.

What is way different from the protections offered by memory cgroup
is that pages protected by memcg:min/low can't be reclaimed regardless
of memory pressure. Such guarantee is not available under pp as it only
suggests an extra factor to consider on deactivating lru pages.

Adding prio in memory controller is another good topic, already queued
after pp and memcg lru on todo list.

> > Who need pp?
> > Users who want to manage/control jitters in lru pages under memory
> > pressure. Way in parallel to scheduling with task's memory footprint
> > taken into account, pp makes task prio a part of page reclaiming.
> 
> How do you assign priority to generally shared pages?

It is solved by setting page prio only when they are added to lru.
Prio will never change thereafter.
There is helper copy_page_prio(new_page, old_page) for scenarios like
migrating pages.

> > [Off topic: prio can also be defined and added in memory controller
> > and then plays a role in memcg reclaiming, for example check prio
> > when selecting victim memcg.]
> > 
> > First on the page side, page->prio that is used to mirror the prio
> > of page owner tasks is added, and a couple of helpers for setting,
> > copying and comparing page->prio to help to add pages to lru.
> > 
> > Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
> > the prio of tasks that wake up the kthread, and it is updated
> > every time kswapd raises its reclaiming priority.
> 
> This sounds like a very bad design to me. You essentially hand over
> to a completely detached context while you want to handle priority
> inversion problems (or at least that is what I think you want).

What was added on the reclaimer side is

1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
   and allocator in terms of prio, to the lowest value before taking
   a nap.

2, any allocator is able to wake up the reclaimer because of the
   lowest prio, and it starts reclaiming pages using the waker's prio.

3, allocator comes while kswapd is active, its prio is checked and
   no-op if kswapd is higher on prio; otherwise switch is updated
   with the higher prio.

4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
   it is checked if there is pending update of switch; and kswapd's
   prio steps up if there is a pending one, thus its prio never steps
   down. Nor prio inversion. 

5, goto 1 when kswapd finishes its work.

Thanks
Hillf