The below commit message talks about tlb bits, but you spend a lot of this series refactoring mm/madvise.c. Can we just separate out the two into separate series please? I am doing the same kind of thing with mremap() at the moment, but sent the refactor _first_ before the work that builds upon it :) Your refactoring is great, so I want to be able to take that (and Andrew's objections obviously don't apply there), and then we can address tlb stuff separately and in a more focused way. On Mon, Mar 10, 2025 at 10:23:09AM -0700, SeongJae Park wrote: > When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or > MADV_FREE with multiple address ranges, tlb flushes happen for each of > the given address ranges. Because such tlb flushes are for same > process, doing those in a batch is more efficient while still being > safe. Modify process_madvise() entry level code path to do such batched > tlb flushes, while the internal unmap logics do only gathering of the > tlb entries to flush. > > In more detail, modify the entry functions to initialize an mmu_gather > ojbect and pass it to the internal logics. And make the internal logics > do only gathering of the tlb entries to flush into the received > mmu_gather object. After all internal function calls are done, the > entry functions flush the gathered tlb entries at once. > > The inefficiency should be smaller on madvise() use case, since it > receives only a single address range. But if there are multiple vmas > for the range, same problem can happen. It is unclear if such use case > is common and the inefficiency is significant. Make the change for > madivse(), too, since it doesn't really change madvise() internal > behavior while helps keeping the code that shared between > process_madvise() and madvise() internal logics clean. > > Patches Seuquence > ================= > > First four patches are minor cleanups of madvise.c for readability. > > Fifth patch defines new data structure for managing information > that required for batched tlb flushes (mmu_gather and behavior), and > update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling > internal logics to receive it. > > Sixth and seventh patches make internal logics for handling > MADV_DONTNEED[_LOCKED] MADV_FREE be ready for batched tlb flushing. The > patches keep the support of unbatched tlb flushes use case, for > fine-grained and safe transitions. > > Eighth patch updates madvise() and process_madvise() code to do the > batched tlb flushes utilizing the previous patches introduced changes. > > The final ninth patch removes the internal logics' unbatched tlb flushes > use case support code, which is no more be used. > > Test Results > ============ > > I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory > using multiple process_madvise() calls. I apply the advice in 4 KiB > sized regions granularity, but with varying batch size per > process_madvise() call (vlen) from 1 to 1024. The source code for the > measurement is available at GitHub[1]. To reduce measurement errors, I > did the measurement five times. > > The measurement results are as below. 'sz_batch' column shows the batch > size of process_madvise() calls. 'Before' and 'After' columns show the > average of latencies in nanoseconds that measured five times on kernels > that built without and with the tlb flushes batching patch of this > series, respectively. For the baseline, mm-unstable tree of > 2025-03-07[2] has been used. 'B-stdev' and 'A-stdev' columns show > ratios of latency measurements standard deviation to average in percent > for 'Before' and 'After', respectively. 'Latency_reduction' shows the > reduction of the latency that the commit has achieved, in percent. > Higher 'Latency_reduction' values mean more efficiency improvements. > > sz_batch Before B-stdev After A-stdev Latency_reduction > 1 128691595.4 6.09 106878698.4 2.76 16.95 > 2 94046750.8 3.30 68691907 2.79 26.96 > 4 80538496.8 5.35 50230673.8 5.30 37.63 > 8 72672225.2 5.50 43918112 3.54 39.57 > 16 66955104.4 4.25 36549941.2 1.62 45.41 > 32 65814679 5.89 33060291 3.30 49.77 > 64 65099205.2 2.69 26003756.4 1.56 60.06 > 128 62591307.2 4.02 24008180.4 1.61 61.64 > 256 64124368.6 2.93 23868856 2.20 62.78 > 512 62325618 5.72 23311330.6 1.74 62.60 > 1024 64802138.4 5.05 23651695.2 3.40 63.50 > > Interestingly, the latency has reduced (improved) even with batch size > 1. I think some of compiler optimizations have affected that, like also > observed with the previous process_madvise() mmap_lock optimization > patch sereis[3]. > > So, let's focus on the proportion between the improvement and the batch > size. As expected, tlb flushes batching provides latency reduction that > proportional to the batch size. The efficiency gain ranges from about > 27 percent with batch size 2, and up to 63 percent with batch size > 1,024. > > Please note that this is a very simple microbenchmark, so real > efficiency gain on real workload could be very different. > > Changes from RFC > (https://lore.kernel.org/20250305181611.54484-1-sj@xxxxxxxxxx) > - Clarify motivation of the change on the cover letter > - Add average and stdev of evaluation results > - Show latency reduction on evaluation results > - Fix !CONFIG_MEMORY_FAILURE build error > - Rename is_memory_populate() to is_madvise_populate() > - Squash patches 5-8 > - Add kerneldoc for unmap_vm_area_struct() > - Squash patches 10 and 11 > - Squash patches 12-14 > - Squash patches 15 and 16 > > References > ========== > > [1] https://github.com/sjp38/eval_proc_madvise > [2] commit e664d7d28a7c ("selftest: test system mappings are sealed") # mm-unstable > [3] https://lore.kernel.org/20250211182833.4193-1-sj@xxxxxxxxxx > > SeongJae Park (9): > mm/madvise: use is_memory_failure() from madvise_do_behavior() > mm/madvise: split out populate behavior check logic > mm/madvise: deduplicate madvise_do_behavior() skip case handlings > mm/madvise: remove len parameter of madvise_do_behavior() > mm/madvise: define and use madvise_behavior struct for > madvise_do_behavior() > mm/memory: split non-tlb flushing part from zap_page_range_single() > mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches > tlb flushes > mm/madvise: batch tlb flushes for > [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) > mm/madvise: remove !tlb support from > madvise_{dontneed,free}_single_vma() > > mm/internal.h | 3 + > mm/madvise.c | 221 +++++++++++++++++++++++++++++++++----------------- > mm/memory.c | 38 ++++++--- > 3 files changed, 176 insertions(+), 86 deletions(-) > > > base-commit: e993f5f5b0ac851cf60578cfee5488031dfaa80c > -- > 2.39.5