On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote: > On Tue, 01 Feb 2022, Matthew Wilcox wrote: > > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote: > > > On Mon, 31 Jan 2022, Matthew Wilcox wrote: > > > > > +++ b/fs/fuse/file.c > > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac) > > > > > > > > > > if (fuse_is_bad(inode)) > > > > > return; > > > > > + if (fc->num_background >= fc->congestion_threshold) > > > > > + return; > > > > > > > > This seems like a bad idea to me. If we don't even start reads on > > > > readahead pages, they'll get ->readpage called on them one at a time > > > > and the reading thread will block. It's going to lead to some nasty > > > > performance problems, exactly when you don't want them. Better to > > > > queue the reads internally and wait for congestion to ease before > > > > submitting the read. > > > > > > > > > > Isn't that exactly what happens now? page_cache_async_ra() sees that > > > inode_read_congested() returns true, so it doesn't start readahead. > > > ??? > > > > It's rather different. Imagine the readahead window has expanded to > > 256kB (64 pages). Today, we see congestion and don't do anything. > > That means we miss the async readahed opportunity, find a missing > > page and end up calling into page_cache_sync_ra(), by which time > > we may or may not be congested. > > > > If the inode_read_congested() in page_cache_async_ra() is removed and > > the patch above is added to replace it, we'll allocate those 64 pages and > > add them to the page cache. But then we'll return without starting IO. > > When we hit one of those !uptodate pages, we'll call ->readpage on it, > > but we won't do anything to the other 63 pages. So we'll go through a > > protracted slow period of sending 64 reads, one at a time, whether or > > not congestion has eased. Then we'll hit a missing page and proceed > > to the sync ra case as above. > > Hmmm... where is all this documented? > The entry for readahead in vfs.rst says: > > If the filesystem decides to stop attempting I/O before reaching the > end of the readahead window, it can simply return. > > but you are saying that if it simply returns, it'll most likely just get > called again. So maybe it shouldn't say that? That's not what I'm saying at all. I'm saying that if ->readahead fails to read the page, ->readpage will be called to read the page (if it's actually accessed). > What do other filesystems do? > ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all > requests. btrfs seems much the same. > xfs uses iomp_readahead .. which again sets REQ_RAHEAD but otherwise > just does a normal read. > > The effect of REQ_RAHEAD seems to be primarily to avoid retries on > failure. > > So it seems that core read-ahead code it not set up to expect readahead > to fail, though it is (begrudgingly) permitted. Well, yes. The vast majority of reads don't fail. > The current inode_read_congested() test in page_cache_async_ra() seems > to be just delaying the inevitable (and in fairness, the comment does > say "Defer...."). Maybe just blocking on the congestion is an equally > good way to delay it... I don't think we should _block_ for an async read request. We're in the context of a process which has read a different page. Maybe what we need is a readahead_abandon() call that removes the just-added pages from the page cache, so we fall back to doing a sync readahead? > I note that ->readahead isn't told if the read-ahead is async or not, so > my patch will drop sync read-ahead on congestion, which the current code > doesn't do. Now that we have a readahead_control, it's simple to add that information to it. > So maybe this congestion tracking really is useful, and we really want > to keep it. > > I really would like to see that high-level documentation!! I've done my best to add documentation. There's more than before I started.