On Sun, Nov 10, 2024 at 08:28:00AM -0700, Jens Axboe wrote: > Add RWF_UNCACHED as a read operation flag, which means that any data > read wil be removed from the page cache upon completion. Uses the page > cache to synchronize, and simply prunes folios that were instantiated > when the operation completes. While it would be possible to use private > pages for this, using the page cache as synchronization is handy for a > variety of reasons: > > 1) No special truncate magic is needed > 2) Async buffered reads need some place to serialize, using the page > cache is a lot easier than writing extra code for this > 3) The pruning cost is pretty reasonable > > and the code to support this is much simpler as a result. > > You can think of uncached buffered IO as being the much more attractive > cousing of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, > it will copy the data, but unlike regular buffered IO, it doesn't run > into the unpredictability of the page cache in terms of reclaim. As an > example, on a test box with 32 drives, reading them with buffered IO > looks as follows: > > Reading bs 65536, uncached 0 > 1s: 145945MB/sec > 2s: 158067MB/sec > 3s: 157007MB/sec > 4s: 148622MB/sec > 5s: 118824MB/sec > 6s: 70494MB/sec > 7s: 41754MB/sec > 8s: 90811MB/sec > 9s: 92204MB/sec > 10s: 95178MB/sec > 11s: 95488MB/sec > 12s: 95552MB/sec > 13s: 96275MB/sec > > where it's quite easy to see where the page cache filled up, and > performance went from good to erratic, and finally settles at a much > lower rate. Looking at top while this is ongoing, we see: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached > 3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4 > 3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5 > 3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6 > 3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10 > 3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17 > 3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26 > 3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21 > 3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22 > 3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27 > 3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30 > 3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31 > 3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7 > 3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23 > 3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8 > 3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28 > 3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3 > 3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19 > 3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12 > 3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29 > 3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1 > [...] > > which is just showing a partial list of the 32 kswapd threads that are > running mostly full tilt, burning ~28 full CPU cores. > > If the same test case is run with RWF_UNCACHED set for the buffered read, > the output looks as follows: > > Reading bs 65536, uncached 0 > 1s: 153144MB/sec > 2s: 156760MB/sec > 3s: 158110MB/sec > 4s: 158009MB/sec > 5s: 158043MB/sec > 6s: 157638MB/sec > 7s: 157999MB/sec > 8s: 158024MB/sec > 9s: 157764MB/sec > 10s: 157477MB/sec > 11s: 157417MB/sec > 12s: 157455MB/sec > 13s: 157233MB/sec > 14s: 156692MB/sec > > which is just chugging along at ~155GB/sec of read performance. Looking > at top, we see: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached > 8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top > > where just the test app is using CPU, no reclaim is taking place outside > of the main thread. Not only is performance 65% better, it's also using > half the CPU to do it. > > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > --- > mm/filemap.c | 18 ++++++++++++++++-- > mm/swap.c | 2 ++ > 2 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 38dc94b761b7..bd698340ef24 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2474,6 +2474,8 @@ static int filemap_create_folio(struct kiocb *iocb, > folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order); > if (!folio) > return -ENOMEM; > + if (iocb->ki_flags & IOCB_UNCACHED) > + __folio_set_uncached(folio); > > /* > * Protect against truncate / hole punch. Grabbing invalidate_lock > @@ -2519,6 +2521,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file, > > if (iocb->ki_flags & IOCB_NOIO) > return -EAGAIN; > + if (iocb->ki_flags & IOCB_UNCACHED) > + ractl.uncached = 1; > page_cache_async_ra(&ractl, folio, last_index - folio->index); > return 0; > } > @@ -2548,6 +2552,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, > return -EAGAIN; > if (iocb->ki_flags & IOCB_NOWAIT) > flags = memalloc_noio_save(); > + if (iocb->ki_flags & IOCB_UNCACHED) > + ractl.uncached = 1; > page_cache_sync_ra(&ractl, last_index - index); > if (iocb->ki_flags & IOCB_NOWAIT) > memalloc_noio_restore(flags); > @@ -2706,8 +2712,16 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, > } > } > put_folios: > - for (i = 0; i < folio_batch_count(&fbatch); i++) > - folio_put(fbatch.folios[i]); > + for (i = 0; i < folio_batch_count(&fbatch); i++) { > + struct folio *folio = fbatch.folios[i]; > + > + if (folio_test_uncached(folio)) { > + folio_lock(folio); > + invalidate_complete_folio2(mapping, folio, 0); > + folio_unlock(folio); I am not sure it is safe. What happens if it races with page fault? The only current caller of invalidate_complete_folio2() unmaps the folio explicitly before calling it. And folio lock prevents re-faulting. I think we need to give up PG_uncached if we see folio_mapped(). And maybe also mark the page accessed. > + } > + folio_put(folio); > + } > folio_batch_init(&fbatch); > } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); > > diff --git a/mm/swap.c b/mm/swap.c > index 835bdf324b76..f2457acae383 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio) > */ > void folio_mark_accessed(struct folio *folio) > { > + if (folio_test_uncached(folio)) > + return; if (folio_test_uncached(folio)) { if (folio_mapped(folio)) folio_clear_uncached(folio); else return; } > if (lru_gen_enabled()) { > folio_inc_refs(folio); > return; > -- > 2.45.2 > -- Kiryl Shutsemau / Kirill A. Shutemov