On Tue, Nov 12, 2024 at 03:22:46PM +0100, David Hildenbrand wrote: > On 12.11.24 14:53, Jason Gunthorpe wrote: > > On Tue, Nov 12, 2024 at 10:10:06AM +0100, David Hildenbrand wrote: > > > On 12.11.24 06:26, Matthew Wilcox wrote: > > > > I don't want you to respin. I think this is a bad idea. > > > > > > I'm hoping you'll find some more time to explain what exactly you don't > > > like, because this series only refactors what we already have. > > > > > > I enjoy seeing the special casing (especially hugetlb) gone from mm/swap.c. I don't. The list of 'if's is better than the indirect function call. That's terribly expensive, and the way we reuse the lru.next field is fragile. Not to mention that it introduces a new thing for the hardening people to fret over. > > And, IMHO, seems like overkill. We have only a handful of cases - > > maybe we shouldn't be trying to get to full generality but just handle > > a couple of cases directly? I don't really think it is such a bad > > thing to have an if ladder on the free path if we have only a couple > > things. Certainly it looks good instead of doing overlaying tricks. > > I'd really like to abstract hugetlb handling if possible. The way it stands > it's just very odd. There might be ways to make that better. I haven't really been looking too hard at making that special handling go away. > We'll need some reliable way to identify these folios that need care. > guest_memfd will be using folio->mapcount for now, so for now we couldn't > set a page type like hugetlb does. If hugetlb can set lru.next at a certain point, then guestmemfd could set a page type at a similar point, no? > > Also how does this translate to Matthew's memdesc world? In a memdesc world, pages no longer have a refcount. We might still have put_page() which will now be a very complicated (and out-of-line) function that looks up what kind of memdesc it is and operates on the memdesc's refcount ... if it has one. I don't know if it'll be exported to modules; I can see uses in the mm code, but I'm not sure if modules will have a need. Each memdesc type will have its own function to call to free the memdesc. So we'll still have folio_put(). But slab does not have, need nor want a refcount, so it'll just slab_free(). I expect us to keep around a list of recently-freed memdescs of a particular type with their pages still attached so that we can allocate them again quickly (or reclaim them under memory pressure). Once that freelist overflows, we'll free a batch of them to the buddy allocator (for the pages) and the slab allocator (for the memdesc itself). > guest_memfd and hugetlb would be operating on folios (at least for now), > which contain the refcount,lru,private, ... so nothing special there. > > Once we actually decoupled "struct folio" from "struct page", we *might* > have to play less tricks, because we could just have a callback pointer > there. But well, maybe we also want to save some space in there. > > Do we want dedicated memdescs for hugetlb/guest_memfd that extend folios in > the future? I don't know, maybe. I've certainly considered going so far as a per-fs folio. So we'd have an ext4_folio, an btrfs_folio, an iomap_folio, etc. That'd let us get rid of folio->private, but I'm not sure that C's type system can really handle this nicely. Maybe in a Rust world ;-) What I'm thinking about is that I'd really like to be able to declare that all the functions in ext4_aops only accept pointers to ext4_folio, so ext4_dirty_folio() can't be called with pointers to _any_ folio, but specifically folios which were previously allocated for ext4. I don't know if Rust lets you do something like that. > I'm currently wondering if we can use folio->private for the time being. > Either > > (a) If folio->private is still set once the refcount drops to 0, it > indicates that there is a freeing callback/owner_ops. We'll have to make > hugetlb not use folio->private and convert others to clear folio->private > before freeing. > > (b) Use bitX of folio->private to indicate that this has "owner_ops" > meaning. We'll have to make hugetlb not use folio->private and make others > not use bitX. Might be harder and overkill, because right now we only really > need the callback when refcount==0. > > (c) Use some other indication that folio->private contains folio_ops. I really don't want to use folio_ops / folio_owner_ops. I read https://lore.kernel.org/all/CAGtprH_JP2w-4rq02h_Ugvq5KuHX7TUvegOS7xUs_iy5hriE7g@xxxxxxxxxxxxxx/ and I still don't understand what you're trying to do. Would it work to use aops->free_folio() to notify you when the folio is being removed from the address space?