On Thu, May 30, 2024 at 03:53:49PM -0700, Chris Li wrote: > On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > Where the anonymous memory case, the dirty page does not have to write > > > to swap. It is optional, so which page you choose to swap out is > > > critical, you want to swap out the coldest page, the page that is > > > least likely to get swapin. Therefore, the LRU makes sense. > > > > Disagree. There are two things you want and the LRU serves neither > > particularly well. One is that when you want to reclaim memory, you > > want to find some memory that is likely to not be accessed in the next > > few seconds/minutes/hours. It doesn't need to be the coldest, just in > > (say) the coldest 10% or so of memory. And it needs to already be clean, > > otherwise you have to wait for it to writeback, and you can't afford that. > > Do you disagree that LRU is necessary or the way we use the LRU? I think we should switch to a scheme where we just don't use an LRU at all. > In order to get the coldest 10% or so pages, assume you still need to > maintain an LRU, no? I don't think that's true. If you reframe the problem as "we need to find some of the coldest pages in the system", then you can use a different scheme. > > The second thing you need to be able to do is find pages which are > > already dirty, and not likely to be written to soon, and write those > > back so they join the pool of clean pages which are eligible for reclaim. > > Again, the LRU isn't really the best tool for the job. > > It seems you need to LRU to find which pages qualify for write back. > It should be both dirty and cold. > > The question is, can you do the reclaim write back without LRU for > anonymous pages? > If LRU is unavoidable, then it is necessarily evil. The point I was trying to make is that a simple physical scan is 40x faster. So if you just scan N pages, starting from wherever you left off the scan last time, and even 1/10 of them are eligible for reclaiming (not referenced since last time the clock hand swept past it, perhaps), you're still reclaiming 4x as many pages as doing an LRU scan. > > > In VMA swap out, the question is, which VMA you choose from first? To > > > make things more complicated, the same page can map into different > > > processes in more than one VMA as well. > > > > This is why we have the anon_vma, to handle the same pages mapped from > > multiple VMAs. > > Can you clarify when you use anon_vma to organize the swap out and > swap in, do you want to write a range of pages rather than just one > page at a time? Will write back a sub list of the LRU work for you? > Ideally we shouldn't write back pages that are hot. anon_vma alone > does not give us that information. So filesystems do write back all pages in an inode that are dirty, regardless of whether they're hot. But, as noted, we do like to get the pagecache written back periodically even if the pages are going to be redirtied soon. And this is somewhere that I think there's a difference between anon & file pages. So maybe the algorithm looks something like this: A: write page fault causes page to be created B: scan unmaps page, marks it dirty, does not start writeout C: scan finds dirty, unmapped anon page, starts writeout D: scan finds clean unmapped anon page, frees it so it will actually take three trips around the whole of memory for the physical scan to evict an anon page. That should be adequate time for a workload to fault back in a page that's actually hot. (if a page fault finds a page in state B, it transitions back to state A and gets three more trips around the clock).