Hi Willy, On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote: > I'm going to respond to some points in detail below, but there are a > couple of overarching themes that I want to bring out up here. > > Grand Vision > ~~~~~~~~~~~~ > > I haven't outlined my long-term plan. Partly because it is a _very_ > long way off, and partly because I think what I'm doing stands on its > own. But some of the points below bear on this, so I'll do it now. > > Eventually, I want to make struct page optional for allocations. It's too > small for some things (allocating page tables, for example), and overly > large for others (allocating a 2MB page, networking page_pool). I don't > want to change its size in the meantime; having a struct page refer to > PAGE_SIZE bytes is something that's quite deeply baked in. Right, I think it's overloaded and it needs to go away from many contexts it's used in today. I think it describes a real physical thing, though, and won't go away as a concept. More on that below. > In broad strokes, I think that having a Power Of Two Allocator > with Descriptor (POTAD) is a useful foundational allocator to have. > The specific allocator that we call the buddy allocator is very clever for > the 1990s, but touches too many cachelines to be good with today's CPUs. > The generalisation of the buddy allocator to the POTAD lets us allocate > smaller quantities (eg a 512 byte block) and allocate descriptors which > differ in size from a struct page. For an extreme example, see xfs_buf > which is 360 bytes and is the descriptor for an allocation between 512 > and 65536 bytes. I actually disagree with this rather strongly. If anything, the buddy allocator has turned out to be a pretty poor fit for the foundational allocator. On paper, it is elegant and versatile in serving essentially arbitrary memory blocks. In practice, we mostly just need 4k and 2M chunks from it. And it sucks at the 2M ones because of the fragmentation caused by the ungrouped 4k blocks. The great thing about the slab allocator isn't just that it manages internal fragmentation of the larger underlying blocks. It also groups related objects by lifetime/age and reclaimability, which dramatically mitigates the external fragmentation of the memory space. The buddy allocator on the other hand has no idea what you want that 4k block for, and whether it pairs up well with the 4k block it just handed to somebody else. But the decision it makes in that moment is crucial for its ability to serve larger blocks later on. We do some mobility grouping based on how reclaimable or migratable the memory is, but it's not the full answer. A variable size allocator without object type grouping will always have difficulties producing anything but the smallest block size after some uptime. It's inherently flawed that way. What HAS proven itself is having the base block size correspond to a reasonable transaction unit for paging and page reclaim, then fill in smaller ranges with lifetime-aware slabbing, larger ranges with vmalloc and SG schemes, and absurdly large requests with CMA. We might be stuck with serving order-1, order-2 etc. for a little while longer for the few users who can't go to kvmalloc(), but IMO it's the wrong direction to expand into. Optimally the foundational allocator would just do one block size. > There are times when we need to get from the physical address to > the descriptor, eg memory-failure.c or get_user_pages(). This is the > equivalent of phys_to_page(), and it's going to have to be a lookup tree. > I think this is a role for the Maple Tree, but it's not ready yet. > I don't know if it'll be fast enough for this case. There's also the > need (particularly for memory-failure) to determine exactly what kind > of descriptor we're dealing with, and also its size. Even its owner, > so we can notify them of memory failure. A tree could be more memory efficient in the long term, but for starters a 2M page could have a struct smallpage *smallpages[512]; member that points to any allocated/mapped 4k descriptors. The page table level would tell you what you're looking at: a pmd is simple, a pte would map to a 4k pfn, whose upper bits identify a struct page then a page flag would tell you whether we have a pte-mapped 2M page or whether the lower pfn bits identify an offset in smallpages[]. It's one pointer for every 4k of RAM, which is a bit dumb, but not as dumb as having an entire struct page for each of those ;) > What I haven't touched on anywhere in this, is whether a folio is the > descriptor for all POTA or whether it's specifically the page cache > descriptor. I like the idea of having separate descriptors for objects > in the page cache from anonymous or other allocations. But I'm not very > familiar with the rmap code, and that wants to do things like manipulate > the refcount on a descriptor without knowing whether it's a file or > anon page. Or neither (eg device driver memory mapped to userspace. > Or vmalloc memory mapped to userspace. Or ...) The rmap code is all about the page type specifics, but once you get into mmap, page reclaim, page migration, we're dealing with fully fungible blocks of memory. I do like the idea of using actual language typing for the different things struct page can be today (fs page), but with a common type to manage the fungible block of memory backing it (allocation state, LRU & aging state, mmap state etc.) New types for the former are an easier sell. We all agree that there are too many details of the page - including the compound page implementation detail - inside the cache library, fs code and drivers. It's a slightly tougher sell to say that the core VM code itself (outside the cache library) needs a tighter abstraction for the struct page building block and the compound page structure. At least at this time while we're still sorting out how it all may work down the line. Certainly, we need something to describe fungible memory blocks: either a struct page that can be 4k and 2M compound, or a new thing that can be backed by a 2M struct page or a 4k struct smallpage. We don't know yet, so I would table the new abstraction type for this. I generally don't think we want a new type that does everything that the overloaded struct page already does PLUS the compound abstraction. Whatever name we pick for it, it'll always be difficult to wrap your head around such a beast. IMO starting with an explicit page cache descriptor that resolves to struct page inside core VM code (and maybe ->fault) for now makes the most sense: it greatly mitigates the PAGE_SIZE and tail page issue right away, and it's not in conflict with, but rather helps work toward, replacing the fungible memory unit behind it. There isn't too much overlap or generic code between cache and anon pages such that sharing a common descriptor would be a huge win (most overlap is at the fungible memory block level, and the physical struct page layout of course), so I don't think we should aim for a generic abstraction for both. As drivers go, I think there are slightly different requirements to filesystems, too. For filesystems, when the VM can finally do it (and the file range permits it), I assume we want to rather transparently increase the unit of data transfer from 4k to 2M. Most drivers that currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want us to bump their allocation sizes. There ARE instances where drivers allocate pages based on buffer_size / PAGE_SIZE and then interact with virtual memory. Those are true VM objects that could grow transparently if PAGE_SIZE grows, and IMO they should share the "fungible memory block" abstraction the VM uses. But there are also many instances where PAGE_SIZE just means 4006 is a good size for me, and struct page is useful for refcounting. Those just shouldn't use whatever the VM or the cache layer are using and stop putting additional burden on an already tricky abstraction. > On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote: > > On Wed, Mar 24, 2021 at 06:24:21AM +0000, Matthew Wilcox wrote: > > > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote: > > > > On Mon, Mar 22, 2021 at 06:47:44PM +0000, Matthew Wilcox wrote: > > > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote: > > > One of the patches I haven't posted yet starts to try to deal with kmap()/mem*()/kunmap(): > > > > > > mm: Add kmap_local_folio > > > > > > This allows us to map a portion of a folio. Callers can only expect > > > to access up to the next page boundary. > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> > > > > > > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h > > > index 7902c7d8b55f..55a29c9d562f 100644 > > > --- a/include/linux/highmem-internal.h > > > +++ b/include/linux/highmem-internal.h > > > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page) > > > return __kmap_local_page_prot(page, kmap_prot); > > > } > > > > > > +static inline void *kmap_local_folio(struct folio *folio, size_t offset) > > > +{ > > > + struct page *page = &folio->page + offset / PAGE_SIZE; > > > + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; > > > +} > > > > > > Partly I haven't shared that one because I'm not 100% sure that 'byte > > > offset relative to start of folio' is the correct interface. I'm looking > > > at some users and thinking that maybe 'byte offset relative to start > > > of file' might be better. Or perhaps that's just filesystem-centric > > > thinking. > > > > Right, this doesn't seem specific to files just because they would be > > the primary users of it. > > Yeah. I think I forgot to cc you on this: > > https://lore.kernel.org/linux-fsdevel/20210325032202.GS1719932@xxxxxxxxxxxxxxxxxxxx/ > > and "byte offset relative to the start of the folio" works just fine: > > + offset = offset_in_folio(folio, diter->pos); > + > +map: > + diter->entry = kmap_local_folio(folio, offset); Yeah, that looks great to me! > > > > But for that to work, we'll need the allocator to produce huge pages > > > > at the necessary rate, too. The current implementation likely won't > > > > scale. Compaction is expensive enough that we have to weigh when to > > > > allocate huge pages for long-lived anon regions, let alone allocate > > > > them for streaming IO cache entries. > > > > > > Heh, I have that as a work item for later this year -- give the page > > > allocator per-cpu lists of compound pages, not just order-0 pages. > > > That'll save us turning compound pages back into buddy pages, only to > > > turn them into compound pages again. > > > > > > I also have a feeling that the page allocator either needs to become a > > > sub-allocator of an allocator that deals in, say, 1GB chunks of memory, > > > or it needs to become reluctant to break up larger orders. eg if the > > > dcache asks for just one more dentry, it should have to go through at > > > least one round of reclaim before we choose to break up a high-order > > > page to satisfy that request. > > > > Slub already allocates higher-order pages for dentries: > > > > slabinfo - version: 2.1 > > # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> > > dentry 133350 133350 192 42 2 : tunables 0 0 0 : slabdata 3175 3175 0 > > > > ^ here > > > > and it could avoid even more internal fragmentation with bigger > > orders. It only doesn't because of the overhead of allocating them. > > Oh, yes. Sorry, I didn't explain myself properly. If we have a > lightly-loaded system with terabytes of memory (perhaps all the jobs > it is running are CPU intensive and don't need much memory), the system > has a tendency to clog up with negative dentries. Hundreds of millions > of them. We rely on memory pressure to get rid of them, and when there > finally is memory pressure, it takes literally hours. > > If there were a slight amount of pressure to trim the dcache at the point > when we'd otherwise break up an order-4 page to get an order-2 page, > the system would work much better. Obviously, we do want the dcache to > be able to expand to the point where it's useful, but at the point that > it's no longer useful, we need to trim it. > > It'd probably be better to have the dcache realise that its old entries > aren't useful any more and age them out instead of relying on memory > pressure to remove old entries, so this is probably an unnecessary > digression. It's difficult to identify a universally acceptable line for usefulness of caches other than physical memory pressure. The good thing about the memory pressure threshold is that you KNOW somebody else has immediate use for the memory, and you're justified in recycling and reallocating caches from the cold end. Without that, you'd either have to set an arbitrary size cutoff or an arbitrary aging cutoff (not used in the last minute e.g.). But optimal settings for either of those depend on the workload, and aren't very intuitive to configure. Such a large gap between the smallest object and the overall size of memory is just inherently difficult to manage. More below. > > If the default block size in the allocator were 2M, we'd also get slab > > packing at that granularity, and we wouldn't have to worry about small > > objects breaking huge pages any more than we worry about slab objects > > fragmenting 4k pages today. > > Yup. I definitely see the attraction of letting the slab allocator > allocate in larger units. On the other hand, you have to start worrying > about underutilisation of the memory at _some_ size, and I'd argue the > sweet spot is somewhere between 4kB and 2MB today. For example: > > fat_inode_cache 110 110 744 22 4 : tunables 0 0 0 : slabdata 5 5 0 > > That's currently using 20 pages. If slab were only allocating 2MB slabs > from the page allocator, I'd have 1.9MB of ram unused in that cache. Right, we'd raise internal fragmentation to a worst case of 2M (minus minimum object size) per slab cache. As a ratio of overall memory, this isn't unprecedented, though: my desktop machine has 32G and my phone has 8G. Divide those by 512 for a 4k base page comparison and you get memory sizes common in the mid to late 90s. Our levels of internal fragmentation are historically low, which of course is nice by itself. But that's also what's causing problems in the form of external fragmentation, and why we struggle to produce 2M blocks. It's multitudes easier to free one 2M slab page of consecutively allocated inodes than it is to free 512 batches of different objects with conflicting lifetimes, ages, or potentially even reclaimability. I don't think we'll have much of a choice when it comes to trading some internal fragmentation to deal with our mounting external fragmentation problem. [ Because of the way fragmentation works I also don't think that 1G would be a good foundational block size. It either wastes a crazy amount of memory on internal fragmentation, or you allow external fragmentation and the big blocks deteriorate with uptime anyway. There really is such a thing as a page: a goldilocks quantity of memory, given the overall amount of RAM in a system, that is optimal as a paging unit and intersection point for the fragmentation axes. This never went away. It just isn't 4k anymore on modern systems. And we're creating a bit of a mess by adapting various places (page allocator, slab, page cache, swap code) to today's goldilocks size while struct page lags behind and doesn't track reality anymore. I think there is a lot of value in disconnecting places from struct page that don't need it, but IMO all in the context of the broader goal of being able to catch up struct page to what the real page is. We may be able to get rid of the 4k backward-compatible paging units eventually when we all have 1TB of RAM. But the concept of a page in a virtual memory system isn't really going anywhere. ] > > > But people seem attached to being able to use smaller page sizes. > > > There's that pesky "compatibility" argument. > > > > Right, that's why I'm NOT saying we should eliminate the support for > > 4k chunks in the page cache and page tables. That's still useful if > > you have lots of small files. > > > > I'm just saying it doesn't have to be the default that everything is > > primarily optimized for. We can make the default allocation size of > > the allocator correspond to a hugepage and have a secondary allocator > > level for 4k chunks. Like slab, but fixed-size and highmem-aware. > > > > It makes sense to make struct page 2M as well. It would save a ton of > > memory on average and reduce the pressure we have on struct page's > > size today. > > > > And we really don't need struct page at 4k just to support this unit > > of paging when necesary: page tables don't care, they use pfns and can > > point to any 4k offset, struct page or no struct page. For the page > > cache, we can move mapping, index, lru. etc from today's struct page > > into an entry descriptor that could either sit in a native 2M struct > > page (just like today), or be be allocated on demand and point into a > > chunked struct page. Same for <2M anonymous mappings. > > > > Hey, didn't you just move EXACTLY those fields into the folio? ;) > > You say page tables don't actually need a struct page, but we do use it. > > struct { /* Page table pages */ > unsigned long _pt_pad_1; /* compound_head */ > pgtable_t pmd_huge_pte; /* protected by page->ptl */ > unsigned long _pt_pad_2; /* mapping */ > union { > struct mm_struct *pt_mm; /* x86 pgds only */ > atomic_t pt_frag_refcount; /* powerpc */ > }; > #if ALLOC_SPLIT_PTLOCKS > spinlock_t *ptl; > #else > spinlock_t ptl; > #endif > }; > > It's a problem because some architectures would really rather > allocate 2KiB page tables (s390) or would like to support 4KiB page > tables on a 64KiB base page size kernel (ppc). > > [actually i misread your comment initially; you meant that page > tables point to PFNs and don't care what struct backs them ... i'm > leaving this in here because it illustrates a problem with change > struct-page-size-to-2MB] Yes, I meant what page table entries point to. The page table (directories) themselves are still 4k as per the architecture, and they'd also have to use smallpage descriptors. I don't immediately see why they couldn't, though. It's not that many, especially if pmd mappings are common (a 4k pmd can map 1G worth of address space).