Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Chris Li <chrisl@xxxxxxxxxx> · Fri, 31 May 2024 17:43:00 -0700

On Thu, May 30, 2024 at 8:12 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Thu, May 30, 2024 at 03:53:49PM -0700, Chris Li wrote:
> > On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > > Where the anonymous memory case, the dirty page does not have to write
> > > > to swap. It is optional, so which page you choose to swap out is
> > > > critical, you want to swap out the coldest page, the page that is
> > > > least likely to get swapin. Therefore, the LRU makes sense.
> > >
> > > Disagree.  There are two things you want and the LRU serves neither
> > > particularly well.  One is that when you want to reclaim memory, you
> > > want to find some memory that is likely to not be accessed in the next
> > > few seconds/minutes/hours.  It doesn't need to be the coldest, just in
> > > (say) the coldest 10% or so of memory.  And it needs to already be clean,
> > > otherwise you have to wait for it to writeback, and you can't afford that.
> >
> > Do you disagree that LRU is necessary or the way we use the LRU?
>
> I think we should switch to a scheme where we just don't use an LRU at
> all.

I would love to hear more details on how to achieve that. Can you elaborate?

>
> > In order to get the coldest 10% or so pages, assume you still need to
> > maintain an LRU, no?
>
> I don't think that's true.  If you reframe the problem as "we need to
> find some of the coldest pages in the system", then you can use a
> different scheme.

If you can have a way to do the reclaim without using LRU at all, that
would be some thing to replace the traditional LRU and MGLRU.
""we need to find some of the coldest pages in the system" that is not
enough for anonymous memory.

You want to find the and reclaim from the coldest memory, if that is
not enough, you need to reclaim more second coldest memory. The
threshold is a moving target depend on the memory pressure.

>
> > > The second thing you need to be able to do is find pages which are
> > > already dirty, and not likely to be written to soon, and write those
> > > back so they join the pool of clean pages which are eligible for reclaim.
> > > Again, the LRU isn't really the best tool for the job.
> >
> > It seems you need to LRU to find which pages qualify for write back.
> > It should be both dirty and cold.
> >
> > The question is, can you do the reclaim write back without LRU for
> > anonymous pages?
> > If LRU is unavoidable, then it is necessarily evil.
>
> The point I was trying to make is that a simple physical scan is 40x
> faster.  So if you just scan N pages, starting from wherever you left
> off the scan last time, and even 1/10 of them are eligible for
> reclaiming (not referenced since last time the clock hand swept past it,
> perhaps), you're still reclaiming 4x as many pages as doing an LRU scan.

I feel that I am missing something. In your 40x faster scan, do you
still scan the page table PTE entry for access bit or not?
If no, I fail to see how you can get the dirty information in the
first place. Unmap a page can get that information at a very high
price.
If yes, then you scan order is not physical any way, you need to find
the PTE entry location and scan that. It is not going to be in the pfn
order.

Also, when reclaiming for a cgroup. You want to scan for memory that
is belong to this cgroup. The page used in this cgroup will be all
over the place, you wouldn't be doing a linear pfn scanning away.
Unless you want to scan for a lot of page that is not belong to this
cgroup. The CPU prefetching and caching contribute to that 40x speed
up would be out of the window.

>
> > > > In VMA swap out, the question is, which VMA you choose from first? To
> > > > make things more complicated, the same page can map into different
> > > > processes in more than one VMA as well.
> > >
> > > This is why we have the anon_vma, to handle the same pages mapped from
> > > multiple VMAs.
> >
> > Can you clarify when you use anon_vma to organize the swap out and
> > swap in, do you want to write a range of pages rather than just one
> > page at a time? Will write back a sub list of the LRU work for you?
> > Ideally we shouldn't write back pages that are hot. anon_vma alone
> > does not give us that information.
>
> So filesystems do write back all pages in an inode that are dirty,
> regardless of whether they're hot.  But, as noted, we do like to
> get the pagecache written back periodically even if the pages are
> going to be redirtied soon.  And this is somewhere that I think there's

Yes, I think there is a critical difference in file system vs
anonymous memory in this regard. In file system write out all dirty
page is more or less OK. It need to eventually happen anyway. Where in
anonymous memory, write out dirty memory has cost associate with it.
It needs to allocate swap entry, put on the swap cache etc. We want to
minimize swap out the page that are hot.

> a difference between anon & file pages.  So maybe the algorithm looks
> something like this:
>
> A: write page fault causes page to be created

You are talking about swap in page fault, right? Are you only going to
write out pages that has recently been swap in?

> B: scan unmaps page, marks it dirty, does not start writeout

Sorry a lot of questions, I just want to make sure I understand what
you are saying correctly.
1) scan in what order? the pfn order or following the anon_vma scan
all page in that anon_vma?
2) The scan process unmaps which page? All pages in anon_vma or the
page recently have a swap in page fault in step A?

> C: scan finds dirty, unmapped anon page, starts writeout

Can you clarify "scan file dirty" where does the "dirty" come from?
Does it only use the above step B or also involve scanning the PTE
dirty/access bit by LRU/MGLRU?
I think you mean the dirty come from step B, just want to make sure.

> D: scan finds clean unmapped anon page, frees it

It seems you are using unmapped page causing page fault to detect if
that page is needed. Which is much more expensive than scanning the
PTE dirty/access bit.

>
> so it will actually take three trips around the whole of memory for
> the physical scan to evict an anon page.  That should be adequate
> time for a workload to fault back in a page that's actually hot.
> (if a page fault finds a page in state B, it transitions back to state
> A and gets three more trips around the clock).

That seems limit to reclaim page you already swap out then recently swap in.

How does it reclaim the first page to when there is no page swap out
previously? It seems it would require step B to unmap all scanned page
not just the swap in one. That would have a lot of performance hit. I
still feel that I am missing some thing in your step A -> D.

Chris