Re: pageless memory & zsmalloc

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 5 Oct 2021 16:13:12 -0400

On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote:
> We're trying to tidy up the mess in struct page, and as part of removing
> slab from struct page, zsmalloc came on my radar because it's using some
> of slab's fields.  The eventual endgame is to get struct page down to a
> single word which points to the "memory descriptor" (ie the current
> zspage).
> 
> zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
> zsmalloc allows compaction.  Currently (from the file):
> 
>  * Usage of struct page fields:
>  *      page->private: points to zspage
>  *      page->freelist(index): links together all component pages of a zspage
>  *              For the huge page, this is always 0, so we use this field
>  *              to store handle.
>  *      page->units: first object offset in a subpage of zspage
>  *
>  * Usage of struct page flags:
>  *      PG_private: identifies the first component page
>  *      PG_owner_priv_1: identifies the huge component page
> 
> This isn't quite everything.  For compaction, zsmalloc also uses
> page->mapping (set in __SetPageMovable()), PG_lock (to sync with
> compaction) and page->_refcount (compaction gets a refcount on the page).
> 
> Since zsmalloc is so well-contained, I propose we completely stop
> using struct page in it, as we intend to do for the rest of the users
> of struct page.  That is, the _only_ element of struct page we use is
> compound_head and it points to struct zspage.
> 
> That means every single page allocated by zsmalloc is PageTail().  Also it
> means that when isolate_movable_page() calls trylock_page(), it redirects
> to the zspage.  That means struct zspage must now have page flags as its
> first element.  Also, zspage->_refcount, and zspage->mapping must match
> their locations in struct page.  That's something that we'll get cleaned
> up eventually, but for now, we're relying on offsetof() assertions.
> 
> The good news is that trylock_zspage() no longer needs to walk the
> list of pages, calling trylock_page() on each of them.
> 
> Anyway, is there a good test suite for zsmalloc()?  Particularly something
> that would exercise its interactions with compaction / migration?
> I don't have any code written yet.

This is some deep hackery.

So to restate - and making sure I understand correctly - the reason for doing it
this way is that the compaction code calls lock_page(); by using compound_head
(instead of page->private) for the pointer to the zspage is so that
compound_head() will return the pointer to zspage, and lock_page() uses
compound_page(), so the compaction code, when it calls lock_page(), will
actually be taking the lock in struct zspage.

So on the one hand, getting struct page down to two words means that we're going
to be moving the page lock bit into those external structs (maybe we _should_ be
doing some kind of inheritence thing, for the allocatee interface?) - so it is
cool that that all lines up.

But long term, we're going to need two words in struct page, not just one: We
need to store the order of the allocation, and a type tagged pointer, and I
don't think we can realistically cram compound_order into the same word as the
type tagged pointer. Plus, transparently using slab for larger allocations -
that today use the buddy allocator interface - means slab can't be using the
allocatee pointer field.

And in my mind, compound_head is allocator state, not allocatee state, and it's
always been using for getting to the head page, which... this is not, so using
it this way, as slick as it is... eh, not sure that's quite what we want to do.

Q: should lock_page() be calling compound_head() at all? If the goal of the type
system stuff that we're doing is to move the compound_head() calls up to where
we're dealing with tail pages and never more after that, then that call in
lock_page() should go - and we'll need to figure out something different for the
migration code to take the lock in the allocatee-private structure.

I've been kind of feeling that since folios are furthest along, separately
allocating them and figuring out how to make that all work might be the logical
next step.