James, On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > It turns out that the THP-like scheme significantly slows down > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > the vast majority of the time spent in MADV_COLLAPSE when collapsing > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > This is only really a problem because this is done between > mmu_notifier_invalidate_range_start() and > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > access any of the 1G page while we're doing this (and it can take like > ~1 second for each 1G, at least on the x86 server I was testing on). Did you try to measure the time, or it's a quick observation from perf? IIRC I used to measure some atomic ops, it is not as drastic as I thought. But maybe it depends on many things. I'm curious how the 1sec is provisioned between the procedures. E.g., I would expect mmu_notifier_invalidate_range_start() to also take some time too as it should walk the smally mapped EPT pgtables. Since we'll still keep the intermediate levels around - from application POV, one other thing to remedy this is further shrink the size of COLLAPSE so potentially for a very large page we can start with building 2M layers. But then collapse will need to be run at least two rounds. -- Peter Xu