Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Wed, 5 Mar 2025 12:22:25 -0800

On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> can happen for each vma of the given address ranges.  Because such tlb
> flushes are for address ranges of same process, doing those in a batch
> is more efficient while still being safe.  Modify madvise() and
> process_madvise() entry level code path to do such batched tlb flushes,
> while the internal unmap logics do only gathering of the tlb entries to
> flush.
> 
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics.  Also modify the internal
> logics to do only gathering of the tlb entries to flush into the
> received mmu_gather object.  After all internal function calls are done,
> the entry functions finish the mmu_gather object to flush the gathered
> tlb entries in the one batch.
> 
> Patches Seuquence
> =================
> 
> First four patches are minor cleanups of madvise.c for readability.
> 
> Following four patches (patches 5-8) define new data structure for
> managing information that required for batched tlb flushing (mmu_gather
> and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
> MADV_FREE handling internal logics to receive it.
> 
> Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
> and MADV_FREE handling logic ready for batched tlb flushing follow. 

I think you forgot to complete the above sentence or the 'follow' at the
end seems weird.

> The
> patches keep the support of unbatched tlb flushes use case, for
> fine-grained and safe transitions.
> 
> Next three patches (patches 12-14) update madvise() and
> process_madvise() code to do the batched tlb flushes utilizing the
> previous patches introduced changes.
> 
> Final two patches (patches 15-16) clean up the internal logics'
> unbatched tlb flushes use case support code, which is no more be used.
> 
> Test Results
> ============
> 
> I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls.  I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size (vlen) from 1 to
> 1024.  The source code for the measurement is available at GitHub[1].
> 
> The measurement results are as below.  'sz_batches' column shows the
> batch size of process_madvise() calls.  'before' and 'after' columns are
> the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
> nanoseconds, on kernels that built without and with the MADV_DONTNEED
> tlb flushes batching patch of this series, respectively.  For the
> baseline, mm-unstable tree of 2025-03-04[2] has been used.
> 'after/before' column is the ratio of 'after' to 'before'.  So
> 'afetr/before' value lower than 1.0 means this patch increased
> efficiency over the baseline.  And lower value means better efficiency.

I would recommend to replace the after/end column with percentage i.e.
percentage improvement or degradation.

> 
>     sz_batches    before       after        after/before
>     1             102842895    106507398    1.03563204828102
>     2             73364942     74529223     1.01586971880929
>     4             58823633     51608504     0.877343022998937
>     8             47532390     44820223     0.942940655834895
>     16            43591587     36727177     0.842529018271347
>     32            44207282     33946975     0.767904595446515
>     64            41832437     26738286     0.639175910310939
>     128           40278193     23262940     0.577556694263817
>     256           41568533     22355103     0.537789077136785
>     512           41626638     22822516     0.54826709762148
>     1024          44440870     22676017     0.510251419470411
> 
> For <=2 batch size, tlb flushes batching shows no big difference but
> slight overhead.  I think that's in an error range of this simple
> micro-benchmark, and therefore can be ignored.  

I would recommend to run the experiment multiple times and report
averages and standard deviation which will support your error range
claim.

> Starting from batch size
> 4, however, tlb flushes batching shows clear efficiency gain.  The
> efficiency gain tends to be proportional to the batch size, as expected.
> The efficiency gain ranges from about 13 percent with batch size 4, and
> up to 49 percent with batch size 1,024.
> 
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.
> 

I think you are running a single thread benchmark on a free machine. I
expect this series to be much more beneficial on loaded machine and for
multi-threaded applications. No need to test that scenario but if you
have already done that then it would be good to report.