On Tue, Feb 19, 2019 at 02:53:02PM +0100, Jan Kara wrote: > Hi! > > On Tue 19-02-19 15:50:23, stummala@xxxxxxxxxxxxxx wrote: > > I am observing huge fsync latencies for a small file under the below test > > scenario - > > > > process A - > > Issue async write of 4GB using dd command (say large_file) on /data mounted > > with ext4: > > dd if=/dev/zero of=/data/testfile bs=1M count=4096 > > > > process B - > > In parallel another process wrote a small 4KB data to another file > > (say, small_file) and has issued fsync on this file. > > > > Problem - > > The fsync() on 4KB file, is taking upto ~30sec (worst case latency). > > This is tested on an eMMC based device. > > > > Observations - > > This happens when the small_file and large_file both are part of the same > > committing transaction or when the small_file is part of the running > > transaction > > while large_file is part of the committing transaction. > > > > During the commit of a transaction which includes large_file, the jbd2 > > thread > > does journal_finish_inode_data_buffers() by calling > > filemap_fdatawait_keep_errors() on the file's inode address space. While > > this is > > happening, if the writeback thread is running in parallel for the > > large_file, then > > filemap_fdatawait_keep_errors() could potentially run in a loop of all the > > pages (upto 4GB of data) and also wait for all the file's data to be written > > to the disk in the current transaction context itself. At the time > > of calling journal_finish_inode_data_buffers(), the file size is of only > > 150MB. > > and by the time filemap_fdatawait_keep_errors() returns, the file size is > > 4GB > > and the page index also points to 4GB file offset in > > __filemap_fdatawait_range(), indicating that is has scanned and waited for > > writeback > > all the pages upto 4GB and not just 150MB. > > Thanks for the detailed analysis! I'm somewhat surprised that the flusher > is able to submit new batch of pages for writeback faster than > __filemap_fdatawait_range() is scanning the radix tree but it is certainly > a possibility. > > > Ideally, I think the jbd2 thread should have waited for only the amount of > > data > > it has submitted as part of the current transaction and not to wait for the > > on-going pages that are getting tagged for writeback in parallel in another > > context. > > So along these lines, I have tried to use the inode's size at the time of > > calling > > journal_finish_inode_data_buffers() as below - > > One has to be really careful when using i_size like this. By the time the > transaction is committing, i_size could have been reduced from the value at > the time page writeback was issued. And that change will be journalled only > in the following transaction. So if the system crashes in the wrong moment, > user could see uninitialized blocks between new_size and old_size after > journal replay. So I don't think your patch is really correct. > Thanks Jan for the clarification on the patch. I agree with your comments. >From that discussion, I think the problem that it is discussing is w.r.t journal thread waiting for on-going active transaction updates to be done and thus causing commit latencies. And I think the proposal is to do not hold any handle while extents are being mapped in ext4_map_blocks() but defer it till IO is completely done. And with the new proposal since the inode will be added to transaction->t_inode_list only after the IO is completed, there will be no longer the need to do journal_finish_inode_data_buffers() in the journal context and thus this problem also will not be observed? Is my understanding correct, please clarify. > Ted has outlined a plan how to get rid of data=ordered limitations [1] and > thus also this problem. It is quite some work but you're certainly welcome > to help out :) > > Honza > > [1] https://www.spinics.net/lists/linux-ext4/msg64175.html > > > > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > > index 2eb55c3..e86cf67 100644 > > --- a/fs/jbd2/commit.c > > +++ b/fs/jbd2/commit.c > > @@ -261,8 +261,8 @@ static int journal_finish_inode_data_buffers(journal_t > > *journal, > > continue; > > jinode->i_flags |= JI_COMMIT_RUNNING; > > spin_unlock(&journal->j_list_lock); > > - err = filemap_fdatawait_keep_errors( > > - jinode->i_vfs_inode->i_mapping); > > + err = > > filemap_fdatawait_range(jinode->i_vfs_inode->i_mapping, > > + 0, > > i_size_read(jinode->i_vfs_inode->i_mapping->host)); > > if (!ret) > > ret = err; > > spin_lock(&journal->j_list_lock); > > > > With this, the fsync latencies for small_file have reduced significantly. > > It took upto max ~5sec (worst case latency). > > > > Although this is seen in a test code, this could potentially impact the > > phone's performance if any application or main UI thread in Android issues > > fsync() in foreground while a large data transfer is going on in another > > context. > > > > Request you to share your thoughts and comments on this issue > > and the fix suggested above. > > > > Thanks, > > Sahitya. > > > > -- > > Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, > > Inc. > > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux > > Foundation Collaborative Project. > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR -- -- Sent by a consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.