Re: [PATCH 8/9] spaceman/defrag: readahead for better performance

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 2 Aug 2024 18:31:58 +0000

> On Jul 30, 2024, at 8:10 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Thu, Jul 18, 2024 at 06:40:46PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>>>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>>>> Do readahead when defrag sleeps for better defrag performace and thus more
>>>> file IO time.
>>>> 
>>>> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx>
>>>> ---
>>>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>>>> index 415fe9c2..ab8508bb 100644
>>>> --- a/spaceman/defrag.c
>>>> +++ b/spaceman/defrag.c
>>>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>>>> }
>>>> 
>>>> static bool g_enable_first_ext_share = true;
>>>> +static bool g_readahead = false;
>>>> +
>>>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>>>> +{
>>>> + if (!g_readahead || g_idle_time <= 0)
>>>> + return;
>>>> +
>>>> + if (readahead(defrag_fd, offset, count) < 0) {
>>>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>>>> + strerror(errno), errno);
>>> 
>>> This doesn't do what you think it does. readahead() only queues the
>>> first readahead chunk of the range given (a few pages at most). It
>>> does not cause readahead on the entire range, wait for page cache
>>> population, nor report IO errors that might have occurred during
>>> readahead.
>> 
>> Is it a bug?
> 
> No.
> 
>> As per the man page it should try to read _count_ bytes:
> 
> No it doesn't. It says:
> 
>> 
>> DESCRIPTION
>>       readahead() initiates readahead on a file
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> It says it -initiates- readahead. It doesn't mean it waits for
> readahead to complete or that it will readahead the whole range.
> It just starts readahead.
> 

I know it’s asynchronous operation. But when given enough time, it should complete.

# without considering your idea of completing the defrag as fast as possible
As we tested with 16MiB segment size limit, we were seeing the unshare takes about
350 ms including the disk data reading. And we are using 250 ms default sleep/idle time.
In my idea, during the 250 ms sleep time, a lot of block reads should be done.

>>> There's almost no value to making this syscall, especially if the
>>> app is about to trigger a sequential read for the whole range.
>>> Readahead will occur naturally during that read operation (i.e. the
>>> UNSHARE copy), and the read will return IO errors unlike
>>> readahead().
>>> 
>>> If you want the page cache pre-populated before the unshare
>>> operation is done, then you need to use mmap() and
>>> madvise(MADV_POPULATE_READ). This will read the whole region into
>>> the page cache as if it was a sequential read, wait for it to
>>> complete and return any IO errors that might have occurred during
>>> the read.
>> 
>> As you know in the unshare path, fetching data from disk is done when IO is locked.
>> (I am wondering if we can improve that.)
> 
> Christoph pointed that out and some potential fixes back in the
> original discussion:
> 
> https://lore.kernel.org/linux-xfs/ZXvQ0YDfHBuvLXbY@xxxxxxxxxxxxx/

Yes, I read that. Thanks Christoph for that.

> 
>> The main purpose of using readahead is that I want less (IO) lock time when fetching
>> data from disk. Can we achieve that by using mmap and madvise()?
> 
> Maybe, but you're still adding complexity to userspace as a work
> around for a kernel issue we should be fixing.
> 

Yes, true. But in case of kernel has no the related fixes and we have to use the defrag, working
around is the way go, right.

Thanks,
Wengang