On 23.09.21 03:21, Kent Overstreet wrote:
One thing that's come out of the folios discussions with both Matthew and Johannes is that we seem to be thinking along similar lines regarding our end goals for struct page. The fundamental reason for struct page is that we need memory to be self describing, without any context - we need to be able to go from a generic untyped struct page and figure out what it contains: handling physical memory failure is the most prominent example, but migration and compaction are more common. We need to be able to ask the thing that owns a page of memory "hey, stop using this and move your stuff here". Matthew's helpfully been coming up with a list of page types: https://kernelnewbies.org/MemoryTypes But struct page could be a lot smaller than it is now. I think we can get it down to two pointers, which means it'll take up 0.4% of system memory. Both Matthew and Johannes have ideas for getting it down even further - the main thing to note is that virt_to_page() _should_ be an uncommon operation (most of the places we're currently using it are completely unnecessary, look at all the places we're using it on the zero page). Johannes is thinking two layer radix tree, Matthew was thinking about using maple trees - personally, I think that 0.4% of system memory is plenty good enough. Ok, but what do we do with the stuff currently in struct page? ------------------------------------------------------------- The main thing to note is that since in normal operation most folios are going to be describing many pages, not just one - and we'll be using _less_ memory overall if we allocate them separately. That's cool. Of course, for this to make sense, we'll have to get all the other stuff in struct page moved into their own types, but file & anon pages are the big one, and that's already being tackled. Why two ulongs/pointers, instead of just one? --------------------------------------------- Because one of the things we really want and don't have now is a clean division between allocator and allocatee state. Allocator meaning either the buddy allocator or slab, allocatee state would be the folio or the network pool state or whatever actually called kmalloc() or alloc_pages(). Right now slab state sits in the same place in struct page where allocatee state does, and the reason this is bad is that slab/slub are a hell of a lot faster than the buddy allocator, and Johannes wants to move the boundary between slab allocations and buddy allocator allocations up to like 64k. If we fix where slab state lives, this will become completely trivial to do. So if we have this: struct page { unsigned long allocator; unsigned long allocatee; }; The allocator field would be used for either a pointer to slab/slub's state, if it's a slab page, or if it's a buddy allocator page it'd encode the order of the allocation - like compound order today, and probably whether or not the (compound group of) pages is free. The allocatee field would be used for a type tagged (using the low bits of the pointer) to one of: - struct folio - struct anon_folio, if that becomes a thing - struct network_pool_page - struct pte_page - struct zone_device_page Then we can further refactor things until all the stuff that's currently crammed in struct page lives in types where each struct field means one and precisely one thing, and also where we can freely reshuffle and reorganize and add stuff to the various types where we couldn't before because it'd make struct page bigger. Other notes & potential issues: - page->compound_dtor needs to die - page->rcu_head moves into the types that actually need it, no issues there - page->refcount has question marks around it. I think we can also just move it into the types that need it; with RCU derefing the pointer to the folio or whatever and grabing a ref on folio->refcount can happen under a RCU read lock - there's no real question about whether it's technically possible to get it out of struct page, and I think it would be cleaner overall that way. However, depending on how it's used from code paths that go from generic untyped pages, I could see it turning into more of a hassle than it's worth. More investigation is needed. - page->memcg_data - I don't know whether that one more properly belongs in struct page or in the page subtypes - I'd love it if Johannes could talk about that one. - page->flags - dealing with this is going to be a huge hassle but also where we'll find some of the biggest gains in overall sanity and readability of the code. Right now, PG_locked is super special and ad hoc and I have run into situations multiple times (and Johannes was in vehement agreement on this one) where I simply could not figure the behaviour of the current code re: who is responsible for locking pages without instrumenting the code with assertions. Meaning anything we do to create and enforce module boundaries between different chunks of code is going to suck, but the end result should be really worthwhile. Matthew Wilcox and David Howells have been having conversations on IRC about what to do about other page bits. It appears we should be able to kill a lot of filesystem usage of both PG_private and PG_private_2 - filesystems in general hang state off of page->private, soon to be folio->private, and PG_private in current use just indicates whether page->private is nonzero - meaning it's completely redundant.
Don't get me wrong, but before there are answers to some of the very basic questions raised above (especially everything that lives in page->flags, which are not only page flags, refcount, ...) this isn't very tempting to spend more time on, from a reviewer perspective.
-- Thanks, David / dhildenb