Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 12 Dec 2024 08:25:08 -0800

On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote:
> On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >
> > On 12.12.24 09:46, Barry Song wrote:
> > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > >>
> > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > >>> From: Barry Song <v-songbaohua@xxxxxxxx>
> > >>>
> > >>> While developing the zeromap series, Usama observed that certain
> > >>> workloads may contain over 10% zero-filled pages. This may present
> > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > >>> in do_swap_page(). If a write occurs later, do_wp_page() can
> > >>> allocate a new page using the Copy-on-Write mechanism.
> > >>
> > >> Shouldn't this be done during, or rather instead of swap out instead?
> > >> Swapping all zero pages out just to optimize the in-memory
> > >> representation on seems rather backwards.
> > >
> > > I’m having trouble understanding your point—it seems like you might
> > > not have fully read the code. :-)
> > >
> > > The situation is as follows: for a zero-filled page, we are currently
> > > allocating a new
> > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > > save the memory used by this page.
> > >
> > > We don't need to allocate the memory until the page is written(which may never
> > > happen).
> >
> > I think what Christoph means is that you would determine that at PTE
> > unmap time, and directly place the zero page in there. So there would be
> > no need to have the page fault at all.
> >
> > I suspect at PTE unmap time might be problematic, because we might still
> > have other (i.e., GUP) references modifying that page, and we can only
> > rely on the page content being stable after we flushed the TLB as well.
> > (I recall some deferred flushing optimizations)
> 
> Yes, we need to follow a strict sequence:
> 
> 1. try_to_unmap - unmap PTEs in all processes;
> 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown;
> 3. pageout - zeromap will set 1 in bitmap if page is zero-filled
> 
> At the moment of pageout(), we can be confident that the page is zero-filled.
> 
> mapping to zeropage during unmap seems quite risky.

You have to unmap and flush to stop modifications, but I think not in
all processes before it's safe to decide. Shared anon pages have COW
semantics; when you enter try_to_unmap() with a page and rmap gives
you a pte, it's one of these:

  a) never forked, no sibling ptes
  b) cow broken into private copy, no sibling ptes
  c) cow/WP; any writes to this or another pte will go to a new page.

In cases a and b you need to unmap and flush the current pte, but then
it's safe to check contents and set the zero pte right away, even
before finishing the rmap walk.

In case c, modifications to the page are impossible due to WP, so you
don't even need to unmap and flush before checking the contents. The
pte lock holds up COW breaking to a new page until you're done.

It's definitely more complicated than the current implementation, but
if it can be made to work, we could get rid of the bitmap.

You might also reduce faults, but I'm a bit skeptical. Presumably
zerofilled regions are mostly considered invalid by the application,
not useful data, so a populating write that will cowbreak seems more
likely to happen next than a faultless read from the zeropage.