Hi Neil! On Thu 10-02-22 16:37:52, NeilBrown wrote: > Add some "big-picture" documentation for read-ahead and polish the code > to make it fit this documentation. > > The meaning of ->async_size is clarified to match its name. > i.e. Any request to ->readahead() has a sync part and an async part. > The caller will wait for the sync pages to complete, but will not wait > for the async pages. The first async page is still marked PG_readahead So I don't think this is how the code was meant. My understanding of readahead comes from a comment: /* * On-demand readahead design. * .... in mm/readahead.c. The ra->size is how many pages should be read. ra->async_size is the "lookahead size" meaning that we should place a marker (PageReadahead) at "ra->size - ra->async_size" to trigger next readahead. > > - in try_context_readahead(), the async_sync is set correctly rather > than being set to 1. Prior to Commit 2cad40180197 ("readahead: make > context readahead more conservative") it was set to ra->size which > is not correct (that implies no sync component). As this was too > high and caused problems it was reduced to 1, again incorrect but less > problematic. The setting provided with this patch does not restore > those problems, and is now not arbitrary. I agree the 1 there looks strange as it effectively discards all the logic handling the lookahead size. I agree with the tweak there but I would do this behavioral change as a separate commit since it can have performance implications. > - The calculation of ->async_size in the initial_readahead section of > ondemand_readahead() now makes sense - it is zero if the chosen > size does not exceed the requested size. This means that we will not > set the PG_readahead flag in this case, but as the requested size > has not been satisfied we can expect a subsequent read ahead request > any way. So I agree that setting of ->async_size to ->size in initial_readahead section does not make great sence but if you look a bit below into readit section, you will notice the ->async_size is overwritten there to something meaninful. So I think the code actually does something sensible, maybe it could be written in a more readable way. > Note that the current function names page_cache_sync_ra() and > page_cache_async_ra() are misleading. All ra request are partly sync > and partly async, so either part can be empty. The meaning of these names IMO is: page_cache_sync_ra() - tell readahead that we currently need a page ractl->_index and would prefer req_count pages fetched ahead. page_cache_async_ra() - called when we hit the lookahead marker to give opportunity to readahead code to prefetch more pages. > A page_cache_sync_ra() request will usually set ->async_size non-zero, > implying it is not all synchronous. > When a non-zero req_count is passed to page_cache_async_ra(), the > implication is that some prefix of the request is synchronous, though > the calculation made there is incorrect - I haven't tried to fix it. > > Signed-off-by: NeilBrown <neilb@xxxxxxx> Honza > --- > Documentation/core-api/mm-api.rst | 19 ++++++- > Documentation/filesystems/vfs.rst | 16 ++++-- > include/linux/fs.h | 9 +++ > mm/readahead.c | 103 ++++++++++++++++++++++++++++++++++++- > 4 files changed, 135 insertions(+), 12 deletions(-) > > diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst > index 395835f9289f..f5b2f92822c8 100644 > --- a/Documentation/core-api/mm-api.rst > +++ b/Documentation/core-api/mm-api.rst > @@ -58,15 +58,30 @@ Virtually Contiguous Mappings > File Mapping and Page Cache > =========================== > > -.. kernel-doc:: mm/readahead.c > - :export: > +Filemap > +------- > > .. kernel-doc:: mm/filemap.c > :export: > > +Readahead > +--------- > + > +.. kernel-doc:: mm/readahead.c > + :doc: Readahead Overview > + > +.. kernel-doc:: mm/readahead.c > + :export: > + > +Writeback > +--------- > + > .. kernel-doc:: mm/page-writeback.c > :export: > > +Truncate > +-------- > + > .. kernel-doc:: mm/truncate.c > :export: > > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > index bf5c48066fac..b4a0baa46dcc 100644 > --- a/Documentation/filesystems/vfs.rst > +++ b/Documentation/filesystems/vfs.rst > @@ -806,12 +806,16 @@ cache in your filesystem. The following members are defined: > object. The pages are consecutive in the page cache and are > locked. The implementation should decrement the page refcount > after starting I/O on each page. Usually the page will be > - unlocked by the I/O completion handler. If the filesystem decides > - to stop attempting I/O before reaching the end of the readahead > - window, it can simply return. The caller will decrement the page > - refcount and unlock the remaining pages for you. Set PageUptodate > - if the I/O completes successfully. Setting PageError on any page > - will be ignored; simply unlock the page if an I/O error occurs. > + unlocked by the I/O completion handler. The set of pages are > + divided into some sync pages followed by some async pages, > + rac->ra->async_size gives the number of async pages. The > + filesystem should attempt to read all sync pages but may decide > + to stop once it reaches the async pages. If it does decide to > + stop attempting I/O, it can simply return. The caller will > + remove the remaining pages from the address space, unlock them > + and decrement the page refcount. Set PageUptodate if the I/O > + completes successfully. Setting PageError on any page will be > + ignored; simply unlock the page if an I/O error occurs. > > ``readpages`` > called by the VM to read pages associated with the address_space > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e2d892b201b0..8b5c486bd4a2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -930,10 +930,15 @@ struct fown_struct { > * struct file_ra_state - Track a file's readahead state. > * @start: Where the most recent readahead started. > * @size: Number of pages read in the most recent readahead. > - * @async_size: Start next readahead when this many pages are left. > - * @ra_pages: Maximum size of a readahead request. > + * @async_size: Numer of pages that were/are not needed immediately > + * and so were/are genuinely "ahead". Start next readahead when > + * the first of these pages is accessed. > + * @ra_pages: Maximum size of a readahead request, copied from the bdi. > * @mmap_miss: How many mmap accesses missed in the page cache. > * @prev_pos: The last byte in the most recent read request. > + * > + * When this structure is passed to ->readahead(), the "most recent" > + * readahead means the current readahead. > */ > struct file_ra_state { > pgoff_t start; > diff --git a/mm/readahead.c b/mm/readahead.c > index cf0dcf89eb69..c44b2957f59f 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -8,6 +8,105 @@ > * Initial version. > */ > > +/** > + * DOC: Readahead Overview > + * > + * Readahead is used to read content into the page cache before it is > + * explicitly requested by the application. Readahead only ever > + * attempts to read pages that are not yet in the page cache. If a > + * page is present but not up-to-date, readahead will not try to read > + * it. In that case a simple ->readpage() will be requested. > + * > + * Readahead is triggered when an application read request (whether a > + * systemcall or a page fault) finds that the requested page is not in > + * the page cache, or that it is in the page cache and has the > + * %PG_readahead flag set. This flag indicates that the page was loaded > + * as part of a previous read-ahead request and now that it has been > + * accessed, it is time for the next read-ahead. > + * > + * Each readahead request is partly synchronous read, and partly async > + * read-ahead. This is reflected in the struct file_ra_state which > + * contains ->size being to total number of pages, and ->async_size > + * which is the number of pages in the async section. The first page in > + * this async section will have %PG_readahead set as a trigger for a > + * subsequent read ahead. Once a series of sequential reads has been > + * established, there should be no need for a synchronous component and > + * all read ahead request will be fully asynchronous. > + * > + * When either of the triggers causes a readahead, three numbers need to > + * be determined: the start of the region, the size of the region, and > + * the size of the async tail. > + * > + * The start of the region is simply the first page address at or after > + * the accessed address, which is not currently populated in the page > + * cache. This is found with a simple search in the page cache. > + * > + * The size of the async tail is determined by subtracting the size that > + * was explicitly requested from the determined request size, unless > + * this would be less than zero - then zero is used. NOTE THIS > + * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED > + * PAGE. > + * > + * The size of the region is normally determined from the size of the > + * previous readahead which loaded the preceding pages. This may be > + * discovered from the struct file_ra_state for simple sequential reads, > + * or from examining the state of the page cache when multiple > + * sequential reads are interleaved. Specifically: where the readahead > + * was triggered by the %PG_readahead flag, the size of the previous > + * readahead is assumed to be the number of pages from the triggering > + * page to the start of the new readahead. In these cases, the size of > + * the previous readahead is scaled, often doubled, for the new > + * readahead, though see get_next_ra_size() for details. > + * > + * If the size of the previous read cannot be determined, the number of > + * preceding pages in the page cache is used to estimate the size of > + * a previous read. This estimate could easily be misled by random > + * reads being coincidentally adjacent, so it is ignored unless it is > + * larger than the current request, and it is not scaled up, unless it > + * is at the start of file. > + * > + * In general read ahead is accelerated at the start of the file, as > + * reads from there are often sequential. There are other minor > + * adjustments to the read ahead size in various special cases and these > + * are best discovered by reading the code. > + * > + * The above calculation determines the readahead, to which any requested > + * read size may be added. > + * > + * Readahead requests are sent to the filesystem using the ->readahead() > + * address space operation, for which mpage_readahead() is a canonical > + * implementation. ->readahead() should normally initiate reads on all > + * pages, but may fail to read any or all pages without causing an IO > + * error. The page cache reading code will issue a ->readpage() request > + * for any page which ->readahead() does not provided, and only an error > + * from this will be final. > + * > + * ->readahead() will generally call readahead_page() repeatedly to get > + * each page from those prepared for read ahead. It may fail to read a > + * page by: > + * > + * * not calling readahead_page() sufficiently many times, effectively > + * ignoring some pages, as might be appropriate if the path to > + * storage is congested. > + * > + * * failing to actually submit a read request for a given page, > + * possibly due to insufficient resources, or > + * > + * * getting an error during subsequent processing of a request. > + * > + * In the last two cases, the page should be unlocked to indicate that > + * the read attempt has failed. In the first case the page will be > + * unlocked by the caller. > + * > + * Those pages not in the final ``async_size`` of the request should be > + * considered to be important and ->readahead() should not fail them due > + * to congestion or temporary resource unavailability, but should wait > + * for necessary resources (e.g. memory or indexing information) to > + * become available. Pages in the final ``async_size`` may be > + * considered less urgent and failure to read them is more acceptable. > + * They will eventually be read individually using ->readpage(). > + */ > + > #include <linux/kernel.h> > #include <linux/dax.h> > #include <linux/gfp.h> > @@ -426,7 +525,7 @@ static int try_context_readahead(struct address_space *mapping, > > ra->start = index; > ra->size = min(size + req_size, max); > - ra->async_size = 1; > + ra->async_size = ra->size - req_size; > > return 1; > } > @@ -527,7 +626,7 @@ static void ondemand_readahead(struct readahead_control *ractl, > initial_readahead: > ra->start = index; > ra->size = get_init_ra_size(req_size, max_pages); > - ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size; > + ra->async_size = ra->size > req_size ? ra->size - req_size : 0; > > readit: > /* > > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR