Re: [PATCH 8/9] spaceman/defrag: readahead for better performance

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 31 Jul 2024 13:10:25 +1000

On Thu, Jul 18, 2024 at 06:40:46PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > 
> > On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
> >> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
> >> Do readahead when defrag sleeps for better defrag performace and thus more
> >> file IO time.
> >> 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx>
> >> ---
> >> spaceman/defrag.c | 21 ++++++++++++++++++++-
> >> 1 file changed, 20 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> >> index 415fe9c2..ab8508bb 100644
> >> --- a/spaceman/defrag.c
> >> +++ b/spaceman/defrag.c
> >> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
> >> }
> >> 
> >> static bool g_enable_first_ext_share = true;
> >> +static bool g_readahead = false;
> >> +
> >> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
> >> +{
> >> + if (!g_readahead || g_idle_time <= 0)
> >> + return;
> >> +
> >> + if (readahead(defrag_fd, offset, count) < 0) {
> >> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
> >> + strerror(errno), errno);
> > 
> > This doesn't do what you think it does. readahead() only queues the
> > first readahead chunk of the range given (a few pages at most). It
> > does not cause readahead on the entire range, wait for page cache
> > population, nor report IO errors that might have occurred during
> > readahead.
> 
> Is it a bug?

No.

> As per the man page it should try to read _count_ bytes:

No it doesn't. It says:

> 
> DESCRIPTION
>        readahead() initiates readahead on a file
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It says it -initiates- readahead. It doesn't mean it waits for
readahead to complete or that it will readahead the whole range.
It just starts readahead.

> > There's almost no value to making this syscall, especially if the
> > app is about to trigger a sequential read for the whole range.
> > Readahead will occur naturally during that read operation (i.e. the
> > UNSHARE copy), and the read will return IO errors unlike
> > readahead().
> > 
> > If you want the page cache pre-populated before the unshare
> > operation is done, then you need to use mmap() and
> > madvise(MADV_POPULATE_READ). This will read the whole region into
> > the page cache as if it was a sequential read, wait for it to
> > complete and return any IO errors that might have occurred during
> > the read.
> 
> As you know in the unshare path, fetching data from disk is done when IO is locked.
> (I am wondering if we can improve that.)

Christoph pointed that out and some potential fixes back in the
original discussion:

https://lore.kernel.org/linux-xfs/ZXvQ0YDfHBuvLXbY@xxxxxxxxxxxxx/

> The main purpose of using readahead is that I want less (IO) lock time when fetching
> data from disk. Can we achieve that by using mmap and madvise()?

Maybe, but you're still adding complexity to userspace as a work
around for a kernel issue we should be fixing.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx