I read through this and unfortunately have nothing useful to contribute to the actual lock sharding. Just two semi-related bits: On Sun, Jan 26, 2025 at 12:46:45AM +0000, Matthew Wilcox wrote: > Postgres are experimenting with doing direct I/O to 1GB hugetlb pages. > Andres has gathered some performance data showing significantly worse > performance with 1GB pages compared to 2MB pages. I sent a patch > recently which improves matters [1], but problems remain. > > The primary problem we've identified is contention of folio->_refcount > with a strong secondary contention on folio->_pincount. This is coming > from the call chain: > > iov_iter_extract_pages -> > gup_fast_fallback -> > try_grab_folio_fast Eww, gup_fast_fallback sent me down the wrong path, as it suggests that is is the fallback slow path, but it's not. It got renamed from internal_get_user_pages_fast in commit 23babe1934d7 ("mm/gup: consistently name GUP-fast functions"). While old name wasn't all that great the new one including fallback is just horrible. Can we please fix this up? The other thing is that the whole GUP machinery takes a reference per page fragment it touches and not just per folio fragment. I'm not sure how fair access to the atomics is, but I suspect the multiple calls to increment/decrement them per operation probably don't make the lock contention any better.