On 16.12.20 19:23, Matthew Wilcox (Oracle) wrote: > One of the great things about compound pages is that when you try to > do various operations on a tail page, it redirects to the head page and > everything Just Works. One of the awful things is how much we pay for > that simplicity. Here's an example, end_page_writeback(): > > if (PageReclaim(page)) { > ClearPageReclaim(page); > rotate_reclaimable_page(page); > } > get_page(page); > if (!test_clear_page_writeback(page)) > BUG(); > > smp_mb__after_atomic(); > wake_up_page(page, PG_writeback); > put_page(page); > > That all looks very straightforward, but if you dive into the disassembly, > you see that there are four calls to compound_head() in this function > (PageReclaim(), ClearPageReclaim(), get_page() and put_page()). It's > all for nothing, because if anyone does call this routine with a tail > page, wake_up_page() will VM_BUG_ON_PGFLAGS(PageTail(page), page). > > I'm not really a CPU person, but I imagine there's some kind of dependency > here that sucks too: > > 1fd7: 48 8b 57 08 mov 0x8(%rdi),%rdx > 1fdb: 48 8d 42 ff lea -0x1(%rdx),%rax > 1fdf: 83 e2 01 and $0x1,%edx > 1fe2: 48 0f 44 c7 cmove %rdi,%rax > 1fe6: f0 80 60 02 fb lock andb $0xfb,0x2(%rax) > > Sure, it's going to be cache hot, but that cmove has to execute before > the lock andb. > > I would like to introduce a new concept that I call a Page Folio. > Or just struct folio to its friends. Here it is, > struct folio { > struct page page; > }; > > A folio is a struct page which is guaranteed not to be a tail page. > So it's either a head page or a base (order-0) page. That means > we don't have to call compound_head() on it and we save massively. > end_page_writeback() reduces from four calls to compound_head() to just > one (at the beginning of the function) and it shrinks from 213 bytes > to 126 bytes (using distro kernel config options). I think even that one > can be eliminated, but I'm going slowly at this point and taking the > safe route of transforming a random struct page pointer into a struct > folio pointer by calling page_folio(). By the end of this exercise, > end_page_writeback() will become end_folio_writeback(). > > This is going to be a ton of work, and massively disruptive. It'll touch > every filesystem, and a good few device drivers! But I think it's worth > it. Not every routine benefits as much as end_page_writeback(), but it > makes everything a little better. At 29 bytes per call to lock_page(), > unlock_page(), put_page() and get_page(), that's on the order of 60kB of > text for allyesconfig. More when you add on all the PageFoo() calls. > With the small amount of work I've done here, mm/filemap.o shrinks its > text segment by over a kilobyte from 33687 to 32318 bytes (and also 192 > bytes of data). Just wondering, as the primary motivation here is "minimizing CPU work", did you run any benchmarks that revealed a visible performance improvement? Otherwise, we're left with a concept that's hard to grasp first (folio - what?!) and "a ton of work, and massively disruptive", saving some kb of code - which does not sound too appealing to me. (I like the idea of abstracting which pages are actually worth looking at directly instead of going via a tail page - tail pages act somewhat like a proxy for the head page when accessing flags) -- Thanks, David / dhildenb