Re: huge fsync latencies for a small file on ext4

Sahitya Tummala <stummala@xxxxxxxxxxxxxx> · Mon, 25 Feb 2019 10:40:07 +0530

On Tue, Feb 19, 2019 at 02:53:02PM +0100, Jan Kara wrote:
> Hi!
> 
> On Tue 19-02-19 15:50:23, stummala@xxxxxxxxxxxxxx wrote:
> > I am observing huge fsync latencies for a small file under the below test
> > scenario -
> > 
> > process A -
> > Issue async write of 4GB using dd command (say large_file) on /data mounted
> > with ext4:
> > dd if=/dev/zero of=/data/testfile bs=1M count=4096
> > 
> > process B -
> > In parallel another process wrote a small 4KB data to another file
> > (say, small_file) and has issued fsync on this file.
> > 
> > Problem -
> > The fsync() on 4KB file, is taking upto ~30sec (worst case latency).
> > This is tested on an eMMC based device.
> > 
> > Observations -
> > This happens when the small_file and large_file both are part of the same
> > committing transaction or when the small_file is part of the running
> > transaction
> > while large_file is part of the committing transaction.
> > 
> > During the commit of a transaction which includes large_file, the jbd2
> > thread
> > does journal_finish_inode_data_buffers() by calling
> > filemap_fdatawait_keep_errors() on the file's inode address space. While
> > this is
> > happening, if the writeback thread is running in parallel for the
> > large_file, then
> > filemap_fdatawait_keep_errors() could potentially run in a loop of all the
> > pages (upto 4GB of data) and also wait for all the file's data to be written
> > to the disk in the current transaction context itself. At the time
> > of calling journal_finish_inode_data_buffers(), the file size is of only
> > 150MB.
> > and by the time filemap_fdatawait_keep_errors() returns, the file size is
> > 4GB
> > and the page index also points to 4GB file offset in
> > __filemap_fdatawait_range(), indicating that is has scanned and waited for
> > writeback
> > all the pages upto 4GB and not just 150MB.
> 
> Thanks for the detailed analysis! I'm somewhat surprised that the flusher
> is able to submit new batch of pages for writeback faster than
> __filemap_fdatawait_range() is scanning the radix tree but it is certainly
> a possibility.
> 
> > Ideally, I think the jbd2 thread should have waited for only the amount of
> > data
> > it has submitted as part of the current transaction and not to wait for the
> > on-going pages that are getting tagged for writeback in parallel in another
> > context.
> > So along these lines, I have tried to use the inode's size at the time of
> > calling
> > journal_finish_inode_data_buffers() as below -
> 
> One has to be really careful when using i_size like this. By the time the
> transaction is committing, i_size could have been reduced from the value at
> the time page writeback was issued. And that change will be journalled only
> in the following transaction. So if the system crashes in the wrong moment,
> user could see uninitialized blocks between new_size and old_size after
> journal replay. So I don't think your patch is really correct.
> 

Thanks Jan for the clarification on the patch. I agree with your comments.

>From that discussion, I think the problem that it is discussing is w.r.t
journal thread waiting for on-going active transaction updates to be done
and thus causing commit latencies. And I think the proposal is to do not
hold any handle while extents are being mapped in ext4_map_blocks() but
defer it till IO is completely done.

And with the new proposal since the inode will be added to
transaction->t_inode_list only after the IO is completed, there will be
no longer the need to do journal_finish_inode_data_buffers() in the journal
context and thus this problem also will not be observed? Is my understanding
correct, please clarify.

> Ted has outlined a plan how to get rid of data=ordered limitations [1] and
> thus also this problem. It is quite some work but you're certainly welcome
> to help out :)
> 
> 								Honza
> 
> [1] https://www.spinics.net/lists/linux-ext4/msg64175.html
> 
> > 
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 2eb55c3..e86cf67 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -261,8 +261,8 @@ static int journal_finish_inode_data_buffers(journal_t
> > *journal,
> >                         continue;
> >                 jinode->i_flags |= JI_COMMIT_RUNNING;
> >                 spin_unlock(&journal->j_list_lock);
> > -               err = filemap_fdatawait_keep_errors(
> > -                               jinode->i_vfs_inode->i_mapping);
> > +               err =
> > filemap_fdatawait_range(jinode->i_vfs_inode->i_mapping,
> > +                               0,
> > i_size_read(jinode->i_vfs_inode->i_mapping->host));
> >                 if (!ret)
> >                         ret = err;
> >                 spin_lock(&journal->j_list_lock);
> > 
> > With this, the fsync latencies for small_file have reduced significantly.
> > It took upto max ~5sec (worst case latency).
> > 
> > Although this is seen in a test code, this could potentially impact the
> > phone's performance if any application or main UI thread in Android issues
> > fsync() in foreground while a large data transfer is going on in another
> > context.
> > 
> > Request you to share your thoughts and comments on this issue
> > and the fix suggested above.
> > 
> > Thanks,
> > Sahitya.
> > 
> > --
> > Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
> > Inc.
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
> > Foundation Collaborative Project.
> -- 
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR

-- 
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.