Re: Inaccessible pages & folios

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Mon, 12 Apr 2021 14:55:14 +0100

On Mon, Apr 12, 2021 at 03:37:18PM +0200, Claudio Imbrenda wrote:
> On Mon, 12 Apr 2021 13:43:41 +0100
> Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> 
> [...]
> 
> > > For the rename case, how would you handle gup.c?  
> > 
> > At first, I'd turn it into
> > arch_make_folio_accessible(page_folio(page));
> 
> that's tricky. what happens if some pages in the folio cannot be made
> accessible?

I was only thinking about the page cache case ...

        access_ret = arch_make_page_accessible(page);
        /*
         * If writeback has been triggered on a page that cannot be made
         * accessible, it is too late to recover here.
         */
        VM_BUG_ON_PAGE(access_ret != 0, page);

... where it seems all pages _can_ be made accessible.

> also, I assume you keep the semantic difference between get_page and
> pin_page? that's also very important for us

I haven't changed anything in gup.c yet.  Just trying to get the page
cache to suck less right now.

> > So what you're saying is that the host might allocate, eg a 1GB folio
> > for a guest, then the guest splits that up into smaller chunks (eg
> > 1MB), and would only want one of those small chunks accessible to the
> > hypervisor?
> 
> qemu will allocate a big chunk of memory, and I/O would happen only on
> small chunks (depending on what the guest does). I don't know how swap
> and pagecache would behave in the folio scenario.
> 
> Also consider that currently we need 4k hardware pages for protected
> guests (so folios would be ok, as long as they are backed by small
> pages)
> 
> How and when are folios created actually?
> 
> is there a way to prevent creation of multi-page folios?

Today there's no way to create multi-page folios because I haven't submitted
the patch to add alloc_folio() and friends:

https://git.infradead.org/users/willy/pagecache.git/commitdiff/4fe26f7a28ffdc850cd016cdaaa74974c59c5f53

We do have a way to allocate compound pages and add them to the page cache,
but that's only in use by tmpfs/shmem.

What will happen is that (for filesystems which support multipage folios),
they'll be allocated by the page cache.  I expect other places will start
to use folios after that (eg anonymous memory), but I don't know where
all those places will be.  I hope not to be involved in that!

The general principle, though, is that the overhead of tracking memory in
page-sized units is too high, and we need to use larger units by default.
There are occasions when we need to do things to memory in smaller units,
and for those, we can choose to either handle sub-folio things, or we
can split a folio apart into smaller folios.

> > > a possible approach maybe would be to keep the _page variant, and
> > > add a _folio wrapper around it  
> > 
> > Yes, we can do that.  It's what I'm currently doing for
> > flush_dcache_folio().
> 
> where would the page flags be stored? as I said, we really depend on
> that bit to be set correctly to prevent potentially disruptive I/O
> errors. It's ok if the bit overindicates protection (non-protected
> pages can be marked as protected), but protected pages must at all
> times have the bit set.
> 
> the reason why this hook exists at all, is to prevent secure pages from
> being accidentally (or maliciously) fed into I/O

You can still use PG_arch_1 on the sub-pages of a folio.  It's one of the
things you'll have to decide, actually.  Does setting PG_arch_1 on
the head page of the folio indicate that the entire page is accessible,
or just that the head page is accessible?  Different page flags have
made different decisions here.