On Mon, 12 Apr 2021 14:55:14 +0100 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: [...] > > I was only thinking about the page cache case ... > > access_ret = arch_make_page_accessible(page); > /* > * If writeback has been triggered on a page that cannot be > made > * accessible, it is too late to recover here. > */ > VM_BUG_ON_PAGE(access_ret != 0, page); > > ... where it seems all pages _can_ be made accessible. yes, for that case it is straightforward > > also, I assume you keep the semantic difference between get_page and > > pin_page? that's also very important for us > > I haven't changed anything in gup.c yet. Just trying to get the page > cache to suck less right now. fair enough :) > > > So what you're saying is that the host might allocate, eg a 1GB > > > folio for a guest, then the guest splits that up into smaller > > > chunks (eg 1MB), and would only want one of those small chunks > > > accessible to the hypervisor? > > > > qemu will allocate a big chunk of memory, and I/O would happen only > > on small chunks (depending on what the guest does). I don't know > > how swap and pagecache would behave in the folio scenario. > > > > Also consider that currently we need 4k hardware pages for protected > > guests (so folios would be ok, as long as they are backed by small > > pages) > > > > How and when are folios created actually? > > > > is there a way to prevent creation of multi-page folios? > > Today there's no way to create multi-page folios because I haven't > submitted the patch to add alloc_folio() and friends: > > https://git.infradead.org/users/willy/pagecache.git/commitdiff/4fe26f7a28ffdc850cd016cdaaa74974c59c5f53 > > We do have a way to allocate compound pages and add them to the page > cache, but that's only in use by tmpfs/shmem. > > What will happen is that (for filesystems which support multipage > folios), they'll be allocated by the page cache. I expect other > places will start to use folios after that (eg anonymous memory), but > I don't know where all those places will be. I hope not to be > involved in that! > > The general principle, though, is that the overhead of tracking > memory in page-sized units is too high, and we need to use larger > units by default. There are occasions when we need to do things to > memory in smaller units, and for those, we can choose to either > handle sub-folio things, or we can split a folio apart into smaller > folios. > > > > > a possible approach maybe would be to keep the _page variant, > > > > and add a _folio wrapper around it > > > > > > Yes, we can do that. It's what I'm currently doing for > > > flush_dcache_folio(). > > > > where would the page flags be stored? as I said, we really depend on > > that bit to be set correctly to prevent potentially disruptive I/O > > errors. It's ok if the bit overindicates protection (non-protected > > pages can be marked as protected), but protected pages must at all > > times have the bit set. > > > > the reason why this hook exists at all, is to prevent secure pages > > from being accidentally (or maliciously) fed into I/O > > You can still use PG_arch_1 on the sub-pages of a folio. It's one of > the things you'll have to decide, actually. Does setting PG_arch_1 on > the head page of the folio indicate that the entire page is > accessible, or just that the head page is accessible? Different page > flags have made different decisions here. ok then, I think the simplest and safest thing to do right now is to keep the flag on each page in short: * pagecache -> you can put a loop or introduce a _folio wrapper for arch_make_page_accessible * gup.c -> won't be touched for now, but when the time comes, the PG_arch_1 bit should be set for each page