For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes can happen for each vma of the given address ranges. Because such tlb flushes are for address ranges of same process, doing those in a batch is more efficient while still being safe. Modify madvise() and process_madvise() entry level code path to do such batched tlb flushes, while the internal unmap logics do only gathering of the tlb entries to flush. In more detail, modify the entry functions to initialize an mmu_gather ojbect and pass it to the internal logics. Also modify the internal logics to do only gathering of the tlb entries to flush into the received mmu_gather object. After all internal function calls are done, the entry functions finish the mmu_gather object to flush the gathered tlb entries in the one batch. Patches Seuquence ================= First four patches are minor cleanups of madvise.c for readability. Following four patches (patches 5-8) define new data structure for managing information that required for batched tlb flushing (mmu_gather and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal logics to receive it. Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED] and MADV_FREE handling logic ready for batched tlb flushing follow. The patches keep the support of unbatched tlb flushes use case, for fine-grained and safe transitions. Next three patches (patches 12-14) update madvise() and process_madvise() code to do the batched tlb flushes utilizing the previous patches introduced changes. Final two patches (patches 15-16) clean up the internal logics' unbatched tlb flushes use case support code, which is no more be used. Test Results ============ I measured the time to apply MADV_DONTNEED advice to 256 MiB memory using multiple process_madvise() calls. I apply the advice in 4 KiB sized regions granularity, but with varying batch size (vlen) from 1 to 1024. The source code for the measurement is available at GitHub[1]. The measurement results are as below. 'sz_batches' column shows the batch size of process_madvise() calls. 'before' and 'after' columns are the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels that built without and with the MADV_DONTNEED tlb flushes batching patch of this series, respectively. For the baseline, mm-unstable tree of 2025-03-04[2] has been used. 'after/before' column is the ratio of 'after' to 'before'. So 'afetr/before' value lower than 1.0 means this patch increased efficiency over the baseline. And lower value means better efficiency. sz_batches before after after/before 1 102842895 106507398 1.03563204828102 2 73364942 74529223 1.01586971880929 4 58823633 51608504 0.877343022998937 8 47532390 44820223 0.942940655834895 16 43591587 36727177 0.842529018271347 32 44207282 33946975 0.767904595446515 64 41832437 26738286 0.639175910310939 128 40278193 23262940 0.577556694263817 256 41568533 22355103 0.537789077136785 512 41626638 22822516 0.54826709762148 1024 44440870 22676017 0.510251419470411 For <=2 batch size, tlb flushes batching shows no big difference but slight overhead. I think that's in an error range of this simple micro-benchmark, and therefore can be ignored. Starting from batch size 4, however, tlb flushes batching shows clear efficiency gain. The efficiency gain tends to be proportional to the batch size, as expected. The efficiency gain ranges from about 13 percent with batch size 4, and up to 49 percent with batch size 1,024. Please note that this is a very simple microbenchmark, so real efficiency gain on real workload could be very different. References ========== [1] https://github.com/sjp38/eval_proc_madvise [2] commit 7b6c5895bb9a ("mm: hugetlb: log time needed to allocate hugepages") # mm-unstable SeongJae Park (16): mm/madvise: use is_memory_failure() from madvise_do_behavior() mm/madvise: split out populate behavior check logic mm/madvise: deduplicate madvise_do_behavior() skip case handlings mm/madvise: remove len parameter of madvise_do_behavior() mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() mm/madvise: pass madvise_behavior struct to madvise_vma_behavior() mm/madvise: make madvise_walk_vmas() visit function receives a void pointer mm/madvise: pass madvise_behavior struct to madvise_dontneed_free() mm/memory: split non-tlb flushing part from zap_page_range_single() mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes mm/madvise: let madvise_free_single_vma() caller batches tlb flushes mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED]) mm/madvise: batch tlb flushes for process_madvise(MADV_FREE) mm/madvise: batch tlb flushes for madvise(MADV_{DONTNEED[_LOCKED],FREE} mm/madvise: remove !tlb support from madvise_dontneed_single_vma() mm/madvise: remove !caller_tlb case of madvise_free_single_vma() mm/internal.h | 3 + mm/madvise.c | 176 ++++++++++++++++++++++++++++++++++---------------- mm/memory.c | 36 +++++++---- 3 files changed, 144 insertions(+), 71 deletions(-) base-commit: f653b037b4a6d83c68098fc3949090dfb63316fb -- 2.39.5