On Thu, Jul 18, 2024 at 06:40:46PM +0000, Wengang Wang wrote: > > > > On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote: > >> Reading ahead take less lock on file compared to "unshare" the file via ioctl. > >> Do readahead when defrag sleeps for better defrag performace and thus more > >> file IO time. > >> > >> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> > >> --- > >> spaceman/defrag.c | 21 ++++++++++++++++++++- > >> 1 file changed, 20 insertions(+), 1 deletion(-) > >> > >> diff --git a/spaceman/defrag.c b/spaceman/defrag.c > >> index 415fe9c2..ab8508bb 100644 > >> --- a/spaceman/defrag.c > >> +++ b/spaceman/defrag.c > >> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd) > >> } > >> > >> static bool g_enable_first_ext_share = true; > >> +static bool g_readahead = false; > >> + > >> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count) > >> +{ > >> + if (!g_readahead || g_idle_time <= 0) > >> + return; > >> + > >> + if (readahead(defrag_fd, offset, count) < 0) { > >> + fprintf(stderr, "readahead failed: %s, errno=%d\n", > >> + strerror(errno), errno); > > > > This doesn't do what you think it does. readahead() only queues the > > first readahead chunk of the range given (a few pages at most). It > > does not cause readahead on the entire range, wait for page cache > > population, nor report IO errors that might have occurred during > > readahead. > > Is it a bug? No. > As per the man page it should try to read _count_ bytes: No it doesn't. It says: > > DESCRIPTION > readahead() initiates readahead on a file ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It says it -initiates- readahead. It doesn't mean it waits for readahead to complete or that it will readahead the whole range. It just starts readahead. > > There's almost no value to making this syscall, especially if the > > app is about to trigger a sequential read for the whole range. > > Readahead will occur naturally during that read operation (i.e. the > > UNSHARE copy), and the read will return IO errors unlike > > readahead(). > > > > If you want the page cache pre-populated before the unshare > > operation is done, then you need to use mmap() and > > madvise(MADV_POPULATE_READ). This will read the whole region into > > the page cache as if it was a sequential read, wait for it to > > complete and return any IO errors that might have occurred during > > the read. > > As you know in the unshare path, fetching data from disk is done when IO is locked. > (I am wondering if we can improve that.) Christoph pointed that out and some potential fixes back in the original discussion: https://lore.kernel.org/linux-xfs/ZXvQ0YDfHBuvLXbY@xxxxxxxxxxxxx/ > The main purpose of using readahead is that I want less (IO) lock time when fetching > data from disk. Can we achieve that by using mmap and madvise()? Maybe, but you're still adding complexity to userspace as a work around for a kernel issue we should be fixing. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx