Re: State of the Page (August 2022)

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 12 Aug 2022 14:34:53 +0100

On Fri, Aug 12, 2022 at 01:16:39PM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 11, 2022 at 10:31:21PM +0100, Matthew Wilcox wrote:
> > ==============================
> > State Of The Page, August 2022
> > ==============================
> > 
> > I thought I'd write down where we are with struct page and where
> > we're going, just to make sure we're all (still?) pulling in a similar
> > direction.
> > 
> > Destination
> > ===========
> > 
> > For some users, the size of struct page is simply too large.  At 64
> > bytes per 4KiB page, memmap occupies 1.6% of memory.  If we can get
> > struct page down to an 8 byte tagged pointer, it will be 0.2% of memory,
> > which is an acceptable overhead.
> 
> Right. This is attractive. But it brings cost of indirection.

It does, but it also crams 8 pages into a single cacheline instead of
occupying one cacheline per page.

> It can be especially painful for physical memory scanning. I guess we can
> derive some info from memdesc type itself, like if it can be movable. But
> still looks like an expensive change.

I just don't think of physical memory scanning as something we do
often, or in a performance-sensitive path.  I'm OK with slowing down
kcompactd if it makes walking the LRU list faster.

> Do you have any estimation on how much CPU time we will pay to reduce
> memory (and cache) overhead? RAM size tend to grow faster than IPC.
> We need to make sure it is the right direction.

I don't.  I've heard colourful metaphors from the hyperscale crowd about
how many more VMs they could sell, usually in terms of putting pallets
of money in the parking lot and setting them on fire.  But IPC isn't the
right metric either, CPU performance is all about cache misses these days.

> > That implies 4 bits needed for the tag, so all memdesc allocations
> > must be 16-byte aligned.  That is not an undue burden.  Memdescs
> > must also be TYPESAFE_BY_RCU if they are mappable to userspace or
> > can be stored in a file's address_space.
> > 
> > It may be worth distinguishing between vmalloc-mappable and
> > vmalloc-unmappable to prevent some things being mapped to userspace
> > inadvertently.
> 
> Given that memdesc represents Slab too, how do we allocate them?

First, we separate out allocating pages from allocating their memdesc.  ie:

struct folio *folio_alloc(u8 order, gfp_t gfp)
{
	struct folio *folio = kmem_cache_alloc(folio_cache, gfp);

	if (!folio)
		return NULL;
	if (page_alloc_desc(order, folio, gfp))
		return folio;
	kmem_cache_free(folio_cache, folio);
	return NULL;
}

That can't work for slab because we might recurse for ever.  So we
have to do it backwards:

struct slab *slab_alloc(size_t size, u8 order, gfp_t gfp)
{
	struct slab *slab;
	struct page *page = page_alloc(order, gfp);

	if (!page)
		return NULL;
	if (sizeof(*slab) == size) {
		slab = page_address(page);
		slab_init(slab, 1);
	} else {
		slab = kmem_cache_alloc(slab_cache, gfp);
		if (!slab) {
			page_free(page, order);
			return NULL;
		}
	}
	page_set_memdesc(page, order, slab);
	return slab;
}

So there is mutual recursion between kmem_cache_alloc() and
slab_alloc(), but it stops after one round.  (obviously this is
just a sketch of a solution)

folio_alloc()
  kmem_cache_alloc(folio)
    page_alloc(folio)
      kmem_cache_alloc(slab)
        page_alloc(slab)
  page_alloc_desc() 

Slab then has to be taught that a slab with a single object allocated
(ie itself) is actually free and can be released back to the pool,
but that seems like a SMOP.

> > Mappable
> > --------
> > 
> > All pages mapped to userspace must have:
> > 
> >  - A refcount
> >  - A mapcount
> > 
> > Preferably in the same place in the memdesc so we can handle them without
> > having separate cases for each type of memdesc.  It would be nice to have
> > a pincount as well, but that's already an optional feature.
> > 
> > I propose:
> > 
> >    struct mappable {
> >        unsigned long flags;	/* contains dirty flag */
> >        atomic_t _refcount;
> >        atomic_t _mapcount;
> >    };
> > 
> >    struct folio {
> >       union {
> >          unsigned long flags;
> >          struct mappable m;
> >       };
> >       ...
> >    };
> 
> Hm. How does lockless page cache lookup would look like in this case?
> 
> Currently it relies on get_page_unless_zero() and to keep it work there's
> should be guarantee that nothing else is allocated where mappable memdesc
> was before. Would it require some RCU tricks on memdesc free?

An earlier paragraph has:

> > That implies 4 bits needed for the tag, so all memdesc allocations
> > must be 16-byte aligned.  That is not an undue burden.  Memdescs
> > must also be TYPESAFE_BY_RCU if they are mappable to userspace or
> > can be stored in a file's address_space.

so yes, I agree, we need this RCU trick to make sure the memdesc remains a
memdesc of the right type, even if it's no longer attached to the right
chunk of memory.