On 13.11.24 05:57, Matthew Wilcox wrote:
On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote:
On 12.11.24 14:53, Jason Gunthorpe wrote:
On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote:
On 12.11.24 06:26, Matthew Wilcox wrote:
I don't want you to respin. I think this is a bad idea.
I'm hoping you'll find some more time to explain what exactly you don't
like, because this series only refactors what we already have.
I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c.
I don't. The list of 'if's is better than the indirect function call.
That's terribly expensive, and the way we reuse the lru.next field
is fragile. Not to mention that it introduces a new thing for the
hardening people to fret over.
Right, indirect calls are nasty and probably more fragile/insecure, but this is
really what ZONE_DEVICE is already using internally ... but I agree that is
less desirable to abstract that.
[...]
We'll need some reliable way to identify these folios that need care.
guest_memfd will be using folio->mapcount for now, so for now we couldn't
set a page type like hugetlb does.
If hugetlb can set lru.next at a certain point, then guestmemfd could
set a page type at a similar point, no?
The main problem is that we will have to set it on small folios that can be
mapped to user space. As long as mapcount overlays page_type, that's
... not going to work.
We won't be truncating+freeing the folio as long as it is mapped to user space,
though. So one workaround would be to set the page type only as long as it
isn't mapped to user space.
Won't win a beauty price, but could be one workaround until we decoupled
the type from the mapcount.
[...]
guest_memfd and hugetlb would be operating on folios (at least for now),
which contain the refcount,lru,private, ... so nothing special there.
Once we actually decoupled "struct folio" from "struct page", we *might*
have to play less tricks, because we could just have a callback pointer
there. But well, maybe we also want to save some space in there.
Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios in
the future? I don't know, maybe.
I've certainly considered going so far as a per-fs folio. So we'd
have an ext4_folio, an btrfs_folio, an iomap_folio, etc. That'd let us
get rid of folio->private, but I'm not sure that C's type system can
really handle this nicely. Maybe in a Rust world ;-)
:)
What I'm thinking about is that I'd really like to be able to declare
that all the functions in ext4_aops only accept pointers to ext4_folio,
so ext4_dirty_folio() can't be called with pointers to _any_ folio,
but specifically folios which were previously allocated for ext4.
Yes, that sounds reasonable. hugetlb definetly might be another such candidate.
I don't know if Rust lets you do something like that.
I'm currently wondering if we can use folio->private for the time being.
Either
(a) If folio->private is still set once the refcount drops to 0, it
indicates that there is a freeing callback/owner_ops. We'll have to make
hugetlb not use folio->private and convert others to clear folio->private
before freeing.
(b) Use bitX of folio->private to indicate that this has "owner_ops"
meaning. We'll have to make hugetlb not use folio->private and make others
not use bitX. Might be harder and overkill, because right now we only really
need the callback when refcount==0.
(c) Use some other indication that folio->private contains folio_ops.
I really don't want to use folio_ops / folio_owner_ops.
Yes, and I understand your reasoning. It was one approach to work around
the page_type vs. mapcount issue and avoiding more checks on the freeing path.
If we manage to make the page type fly, the following could work and leave
the ordinary folio freeing path as fast as before (and allow me to add the
PGTY_offline handling :) ):
From 5a55e4bcf4d6cfa64d3383a7cf6649042cedcecb Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@xxxxxxxxxx>
Date: Wed, 13 Nov 2024 12:20:58 +0100
Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
include/linux/page-flags.h | 11 +++++++++++
mm/swap.c | 27 ++++++++++++++++++++++-----
2 files changed, 33 insertions(+), 5 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e80665bc51fac..ebf89075eeb5f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -950,6 +950,7 @@ enum pagetype {
PGTY_slab = 0xf5,
PGTY_zsmalloc = 0xf6,
PGTY_unaccepted = 0xf7,
+ PGTY_guestmem = 0xf8,
PGTY_mapcount_underflow = 0xff
};
@@ -970,6 +971,16 @@ static inline bool page_has_type(const struct page *page)
return page_mapcount_is_type(data_race(page->page_type));
}
+static inline bool folio_has_type(const struct folio *folio)
+{
+ return page_has_type(&folio->page);
+}
+
+static inline int folio_get_type(const struct folio *folio)
+{
+ return folio->page.page_type >> 24;
+}
+
#define FOLIO_TYPE_OPS(lname, fname) \
static __always_inline bool folio_test_##fname(const struct folio *folio) \
{ \
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa17..bf4efc7bba18a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,6 +94,22 @@ static void page_cache_release(struct folio *folio)
unlock_page_lruvec_irqrestore(lruvec, flags);
}
+static void free_typed_folio(struct folio *folio)
+{
+ switch (folio_get_type(folio)) {
+ case PGTY_hugetlb:
+ free_huge_folio(folio);
+ return;
+ case PGTY_offline:
+ /* Nothing to do, it's offline. */
+ return;
+ case PGTY_guestmem:
+ // free_guestmem_folio(folio);
+ default:
+ WARN_ON_ONCE(1);
+ }
+}
+
void __folio_put(struct folio *folio)
{
if (unlikely(folio_is_zone_device(folio))) {
@@ -101,8 +117,8 @@ void __folio_put(struct folio *folio)
return;
}
- if (folio_test_hugetlb(folio)) {
- free_huge_folio(folio);
+ if (unlikely(folio_has_type(folio))) {
+ free_typed_folio(folio);
return;
}
@@ -934,15 +950,16 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
if (!folio_ref_sub_and_test(folio, nr_refs))
continue;
- /* hugetlb has its own memcg */
- if (folio_test_hugetlb(folio)) {
+ if (unlikely(folio_has_type(folio))) {
+ /* typed folios have their own memcg, if any */
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- free_huge_folio(folio);
+ free_typed_folio(folio);
continue;
}
+
folio_unqueue_deferred_split(folio);
__page_cache_release(folio, &lruvec, &flags);
--
2.47.0
I read
https://lore.kernel.org/all/CAGtprH_JP2w-4rq02h_Ugvq5KuHX7TUvegOS7xUs_iy5hriE7g@xxxxxxxxxxxxxx/
and I still don't understand what you're trying to do.
It's a bunch of related issues (e.g., one idea is also having another "global" pool
of large pages that are not owned by hugetlb), but the biggest thing we want to sort
out is the following:
We allocated a large folio from somewhere (hugetlb, another global pool). We might
have to split this folio internally in some cases (for example, because we need
precise per-page refcounts/mapcounts) and we:
(a) Really have to return the whole re-collapsed folio to the original allocator.
(b) Might have to clean up some stuff (e.g., restore original page type) before handing
the folio back to the original allocator.
(c) Want to re-collapse (parts of?) the folio when we don't need the split anymore.
Re-collapsing is one problem because of possible GUP pins (for parts that were
exposed to user space) and speculative references.
As an example during (a): we might have truncated the whole split thing, but some
reference on a split folio prevents us from re-collapsing the whole thing
synchronously during truncation. As soon as that last reference is put,
we can collapse the thing and return it to the original allocator.
Similar during (c), just that we want to "freeze" all refcounts as the last
references are put, so we can just collapse it once we are notified that
it is now possible -- and that no other speculative references can show up as
the refcount is 0.
Would it work to use aops->free_folio() to notify you when the folio is
being removed from the address space?
We really have to know when the last reference is gone, and intercept that
to cleanup and prepare the large folio to go back to its original allocator.
As always, thanks for taking your time and the detailed response Willy!
--
Cheers,
David / dhildenb