On Thu, Feb 01, 2024 at 11:43:11PM -0500, Mike Snitzer wrote: > On Thu, Feb 01 2024 at 9:20P -0500, > Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > > specific range, so it is usually sequential IO if VMA is backed by > > file. > > > > Set ra_pages as device max request size for the involved readahead in > > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > > usual(default) 128KB of read_ahead_kb. > > > > Cc: David Hildenbrand <david@xxxxxxxxxx> > > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> > > Cc: Christian Brauner <brauner@xxxxxxxxxx> > > Cc: Don Dutile <ddutile@xxxxxxxxxx> > > Cc: Rafael Aquini <raquini@xxxxxxxxxx> > > Cc: Dave Chinner <david@xxxxxxxxxxxxx> > > Cc: Mike Snitzer <snitzer@xxxxxxxxxx> > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > --- > > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 51 insertions(+), 1 deletion(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..db5452c8abdd 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > return -EINVAL; > > } > > > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > > +{ > > + if (*file) { > > + struct file *f = *file; > > + > > + f->f_ra.ra_pages = ra_pages; > > + fput(f); > > + *file = NULL; > > + } > > +} > > + > > +static struct file *madvise_override_ra_win(struct file *f, > > + unsigned long start, unsigned long end, > > + unsigned int *old_ra_pages) > > +{ > > + unsigned int io_pages; > > + > > + if (!f || !f->f_mapping || !f->f_mapping->host) > > + return NULL; > > + > > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > > + if (((end - start) >> PAGE_SHIFT) < io_pages) > > + return NULL; > > + > > + f = get_file(f); > > + *old_ra_pages = f->f_ra.ra_pages; > > + f->f_ra.ra_pages = io_pages; > > + > > + return f; > > +} > > + > > Does this override imply that madvise_populate resorts to calling > filemap_fault() and here you're just arming it to use the larger > ->io_pages for the duration of all associated faulting? Yes. > > Wouldn't it be better to avoid faulting and build up larger page How can we avoid the fault handling? which is needed to build VA->PA mapping. > vectors that get sent down to the block layer in one go and let the filemap_fault() already tries to allocate folio in big size(max order is MAX_PAGECACHE_ORDER), see page_cache_ra_order() and ra_alloc_folio(). > block layer split using the device's limits? (like happens with > force_page_cache_ra) Here filemap code won't deal with block directly because there is VFS & FS and io mapping is required, and it just calls aops->readahead() or aops->read_folio(), but block plug & readahead_control are applied for handling everything in batch. > > I'm concerned that madvise_populate isn't so efficient with filemap That is why this patch increases readahead window, then madvise_populate() performance can be improved by X10 in big file-backed popluate read. > due to excessive faulting (*BUT* I haven't traced to know, I'm just > inferring that is why twiddling f->f_ra.ra_pages helps improve > madvise_populate by having it issue larger IO. Apologies if I'm way > off base) As mentioned, fault handling can't be avoided, but we can improve involved readahead IO perf. Thanks, Ming