>> >> Thanks for making me rerun this and yes indeed I had a very silly bug in the >> benchmark code (i.e. madvise the same page for the whole loop) and this is >> indeed several times slower than without the patch (sorry David for misleading >> you). No worries, BUGs happen :) >> >> To better understand what is happening, I profiled the benchmark: >> >> - 31.27% 0.01% dontneed [kernel.kallsyms] [k] zap_page_range_sync >> - 31.27% zap_page_range_sync >> - 30.25% split_local_deferred_list >> - 30.16% split_huge_page_to_list >> - 21.05% try_to_migrate >> + rmap_walk_anon >> - 7.47% remove_migration_ptes >> + 7.34% rmap_walk_locked >> + 1.02% zap_page_range_details > > Makes sense, thanks for verifying it, Shakeel. I forgot it'll also walk > itself. > > I believe this effect will be exaggerated when the mapping is shared, > e.g. shmem file thp between processes. What's worse is that when one process > DONTNEED one 4k page, all the rest mms will need to split the huge pmd without > even being noticed, so that's a direct suffer from perf degrade. Would this really apply to MADV_DONTNEED on shmem, and would deferred splitting apply on shmem? I'm constantly confused about shmem vs. anon, but I would have assumed that shmem is fd-based and we wouldn't end up in rmap_walk_anon. For shmem, the pagecache would contain the THP which would stick around and deferred splits don't even apply. But again, I'm constantly confused so I'd love to be enlighted. > >> >> The overhead is not due to copying page flags but rather due to two rmap walks. >> I don't think this much overhead is justified for current users of MADV_DONTNEED >> and munmap. I have to rethink the approach. Most probably not. > > Some side notes: I digged out the old MADV_COLLAPSE proposal right after I > thought about MADV_SPLIT (or any of its variance): > > https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@xxxxxxxxxx/ > > My memory was that there's some issue to be solved so that was blocked, however > when I read the thread it sounds like the list was mostly reaching a consensus > on considering MADV_COLLAPSE being beneficial. Still copying DavidR in case I > missed something important. > > If we think MADV_COLLAPSE can help to implement an userspace (and more > importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of > kcompactd, perhaps. > > That's probably a bit off topic of this specific discussion on the specific use > case, but so far it seems all reasonable and discussable. User space can trigger a split manually using some MADV hackery. But it can only be used for the use case here, where we actually want to zap a page. 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE and the compound page. 2. MADV_DONTNEED either the complete range or the single 4k page. -- Thanks, David / dhildenb