On Thu, Aug 10, 2023 at 07:47:35PM +0200, David Hildenbrand wrote: > On 10.08.23 19:15, Peter Xu wrote: > > On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote: > > > > For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is > > > > really significant (judging from my experience on managing PageAnonExclusive > > > > using set_bit/test_bit/clear_bit when (un)mapping anon pages). > > > > > > > > As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be > > > > batching where possible (and Ryan is working on some more rmap batching). > > > > > > Yes, I've just posted [1] which batches the rmap removal. That would allow you > > > to convert the per-page atomic_dec() into a (usually) single per-large-folio > > > atomic_sub(). > > > > > > [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@xxxxxxx/ > > > > Right, that'll definitely make more sense, thanks for the link; I'd be very > > happy to read more later (finally I got some free time recently..). But > > then does it mean David's patch can be attached at the end instead of > > proposed separately and early? > > Not in my opinion. Batching rmap makes sense even without this change, and > this change makes sense even without batching. > > > > > I was asking mostly because I read it as a standalone patch first, and > > honestly I don't know the effect. It's based on not only the added atomic > > ops itself, but also the field changes. > > > > For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail > > page, I think it means for any rmap change of any small page of a huge one > > we'll need to start touching one more 64B cacheline on x86. I really have > > no idea what does it mean for especially a large SMP: see 292648ac5cf1 on > > why I had an impression of that. But I've no enough experience or clue to > > prove it a problem either, maybe would be interesting to measure the time > > needed for some pte-mapped loops? E.g., something like faulting in a thp, > > Okay, so your speculation right now is: > > 1) The change in cacheline might be problematic. > > 2) The additional atomic operation might be problematic. > > > then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it > > takes before/after this patch. > > I can certainly try getting some numbers on that. If you're aware of other > micro-benchmarks that would likely notice slower pte-mapping of THPs, please > let me know. Thanks. > > > > > When looking at this, I actually found one thing that is slightly > > confusing, not directly relevant to your patch, but regarding the reuse of > > tail page 1 on offset 24 bytes. Current it's Hugh's _nr_pages_mapped, > > and you're proposing to replace it with the total mapcount: > > > > atomic_t _nr_pages_mapped; /* 88 4 */ > > > > Now my question is.. isn't byte 24 of tail page 1 used for keeping a > > poisoned mapping? See prep_compound_tail() where it has: > > > > p->mapping = TAIL_MAPPING; > > > > While here mapping is, afaict, also using offset 24 of the tail page 1: > > > > struct address_space * mapping; /* 24 8 */ > > > > I hope I did a wrong math somewhere, though. > > > > I think your math is correct. > > prep_compound_head() is called after prep_compound_tail(), so > prep_compound_head() wins. > > In __split_huge_page_tail() there is a VM_BUG_ON_PAGE() that explains the > situation: > > /* ->mapping in first and second tail page is replaced by other uses */ > VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, > page_tail); > > Thanks for raising that, I had to look into that myself. It's so confusing so I did try to document them a bit myself, then I found maybe I should just post a patch for it and I just did: https://lore.kernel.org/r/20230810204944.53471-1-peterx@xxxxxxxxxx It'll conflict with yours, but I marked RFC so it's never anything urgent, but maybe that'll be helpful already for you to introduce any new fields like total_mapcounts. AFAICS if that patch was all correct (while I'm not yet sure..), you can actually fit your new total mapcount field into page 1 so even avoid the extra cacheline access. You can have a look: the trick is refcount for tail page 1 is still seems to be free on 32 bits (if that was your worry before). Then it'll be very nice if to keep Hugh's counter all in tail 1. -- Peter Xu