RE: ext4 out of order when use cfq scheduler

"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@xxxxxxxxxxxx> · Fri, 8 Jan 2016 02:18:40 +0000

> -----Original Message-----
> From: Jan Kara [mailto:jack@xxxxxxx]
> Sent: Thursday, January 07, 2016 8:19 PM
> To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@xxxxxxxxxxxx>
> Cc: Jan Kara <jack@xxxxxxx>; linux-ext4@xxxxxxxxxxxxxxx; Li, Michael
> <huayil@xxxxxxxxxxxxxxxx>
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> On Thu 07-01-16 12:47:36, Jan Kara wrote:
> > On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > > > -----Original Message-----
> > > > From: Jan Kara [mailto:jack@xxxxxxx]
> > > > Sent: Thursday, January 07, 2016 6:24 PM
> > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@xxxxxxxxxxxx>
> > > > Cc: Jan Kara <jack@xxxxxxx>; linux-ext4@xxxxxxxxxxxxxxx
> > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > >
> > > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > -----Original Message-----
> > > > > > From: Jan Kara [mailto:jack@xxxxxxx]
> > > > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@xxxxxxxxxxxx>
> > > > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > > > >
> > > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > > > So you are running in 'ws' mode of your tool, am I right?
> > > > > > > > Just looking into the sources you've sent me I've noticed
> > > > > > > > that although you set O_SYNC in openflg when mode ==
> > > > > > > > MODE_WS, you do not use openflg at all. So file won't be
> > > > > > > > synced at all. That would well explain why you see that
> > > > > > > > not all file contents is written. So did you just send me
> > > > > > > > a different version of the source or is your test program
> > > > > > really buggy?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, it is a bug of the test code. So the test tool create
> > > > > > > files without O_SYNC flag actually.  But , even in this
> > > > > > > case, is the out of order acceptable ? or is it normal ?
> > > > > >
> > > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some
> > > > > > files are written and others are not since nobody guarantees
> > > > > > order of writeback of inodes. OTOH you shouldn't ever see
> > > > > > uninitialized data in the inode (but so far it isn't clear to
> > > > > > me whether you really see unitialized data or whether we
> > > > > > really wrote zeros to those blocks -
> > > > > > ext4 can sometimes decide to do so).  Your traces and disk
> > > > > > contents show that the problematic inode has extent of length
> > > > > > 128 blocks starting at block
> > > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > > > What is the block size of the filesystem?  Because inode size is only
> 0x40010.
> > > > > >
> > > > > > Some suggestions to try:
> > > > > > 1) Print also length of a write request in addition to the
> > > > > > starting block so that we can see how much actually got
> > > > > > written
> > > > >
> > > > > Please see below failure analysis.
> > > > >
> > > > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > > > uninitialized blocks from zeroed-out blocks.
> > > > >
> > > > > Yes, i Initialize the device to 0xff this time.
> > > > >
> > > > > > 3) Report exactly for which 512-byte blocks checksum matches
> > > > > > and for which it is wrong.
> > > > > The wrong contents are old file contents which are created in
> > > > > previous test round.  It is caused by the "wrong" sequence inode
> > > > > data(in
> > > > > journal) and  the file contents. So the file contents are not updated.
> > > >
> > > > So this confuses me somewhat. You previously said that you always
> > > > remove files after each test round and then new ones are created.
> > > > Is it still the case? So the old file contents you speak about
> > > > above is just some random contents that happened to be in disk blocks we
> freshly allocated to the file, am I right?
> > >
> > > Yes. You are right.
> > >  The "old file contents" means that the disk blocks which the contents is
> generated from last test round, and they are allocated to a new file in new test
> round.
> > >
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And
> > > > I was wrong. It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block layer until
> transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code.  I will try to
> understand your explanation which you describe above.  And have a look on
> related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> 
> Oh, I have realized the patch is on top of current ext4 development tree and it
> won't compile for current vanilla kernel because of EXT4_GET_BLOCKS_ZERO
> check. Just remove that line when  you get compilation failure.
> 
> > +		if (map->m_flags & EXT4_MAP_NEW &&
> > +		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> > +		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
> 
> Just remove the above line and things should work for older kernels as well.
> 
> > +		    ext4_should_order_data(inode)) {
> > +			ret = ext4_jbd2_file_inode(handle, inode);
> > +			if (ret)
> > +				return ret;
> > +		}
> >  	}
> >  	return retval;
> >  }
> 

Just confirmed with you because the patch tool didn't found:

"out_sem:
 		ret = check_block_validity(inode, map);" in my kernel.

after checking the code, I add the modification to the end of function : ext4_map_blocks
below is the diff. please help to double confirm.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10b71e4..d29a1d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -753,6 +753,10 @@ has_zeroout:
                int ret = check_block_validity(inode, map);
                if (ret != 0)
                        return ret;
+               if(ext4_should_order_data(inode)) {
+                       ret = ext4_jbd2_file_inode(handle, inode);
+               if (ret)
+                       return ret;
        }
        return retval;
 }
@@ -1113,15 +1117,6 @@ static int ext4_write_end(struct file *file,
        int i_size_changed = 0;

        trace_ext4_write_end(inode, pos, len, copied);
-       if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-               ret = ext4_jbd2_file_inode(handle, inode);
-               if (ret) {
-                       unlock_page(page);
-                       page_cache_release(page);
-                       goto errout;
-               }
-       }
-
        if (ext4_has_inline_data(inode)) {
                ret = ext4_write_inline_data_end(inode, pos, len,
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html