Re: [PATCH 8/9] spaceman/defrag: readahead for better performance

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Thu, 18 Jul 2024 18:40:46 +0000

> On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>> Do readahead when defrag sleeps for better defrag performace and thus more
>> file IO time.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx>
>> ---
>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 415fe9c2..ab8508bb 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>> }
>> 
>> static bool g_enable_first_ext_share = true;
>> +static bool g_readahead = false;
>> +
>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>> +{
>> + if (!g_readahead || g_idle_time <= 0)
>> + return;
>> +
>> + if (readahead(defrag_fd, offset, count) < 0) {
>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>> + strerror(errno), errno);
> 
> This doesn't do what you think it does. readahead() only queues the
> first readahead chunk of the range given (a few pages at most). It
> does not cause readahead on the entire range, wait for page cache
> population, nor report IO errors that might have occurred during
> readahead.

Is it a bug? As per the man page it should try to read _count_ bytes:

DESCRIPTION
       readahead() initiates readahead on a file so that subsequent reads from that file will be satisfied from the cache, and not block on disk I/O (assuming the readahead was initiated early enough and that
       other activity on the system did not in the meantime flush pages from the cache).

       The fd argument is a file descriptor identifying the file which is to be read.  The offset argument specifies the starting point from which data is to be read and count specifies the number of bytes to
       be read.  I/O is performed in whole pages, so that offset is effectively rounded down to a page boundary and bytes are read up to the next page boundary greater than or equal to (offset+count).  reada‐
       head() does not read beyond the end of the file.  The file offset of the open file description referred to by fd is left unchanged.

> 
> There's almost no value to making this syscall, especially if the
> app is about to trigger a sequential read for the whole range.
> Readahead will occur naturally during that read operation (i.e. the
> UNSHARE copy), and the read will return IO errors unlike
> readahead().
> 
> If you want the page cache pre-populated before the unshare
> operation is done, then you need to use mmap() and
> madvise(MADV_POPULATE_READ). This will read the whole region into
> the page cache as if it was a sequential read, wait for it to
> complete and return any IO errors that might have occurred during
> the read.

As you know in the unshare path, fetching data from disk is done when IO is locked.
(I am wondering if we can improve that.)
The main purpose of using readahead is that I want less (IO) lock time when fetching
data from disk. Can we achieve that by using mmap and madvise()?

Thanks,
Wengang