Re: [PATCH] mke2fs: add make_hugefile feature

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Thu, 23 Jan 2014 17:37:21 -0800

On Tue, Jan 21, 2014 at 04:39:43PM -0500, Theodore Ts'o wrote:
> On Tue, Jan 21, 2014 at 11:45:17AM -0700, Andreas Dilger wrote:
> > > Then "mke2fs -T hugefile /dev/sdXX" will create as many 1G files
> > > needed to fill the file system.
> > 
> > How is this different from using fallocate to allocate the files?
> 
> There are a couple of differences.  One is that currently using
> fallocate to allocate the file results in an embarassingly bad extent
> tree:
> 
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    2047:      34816..     36863:   2048:             unwritten
>    1:     2048..    4095:      36864..     38911:   2048:             unwritten
>    2:     4096..    6143:      38912..     40959:   2048:             unwritten
>    3:     6144..    8191:      40960..     43007:   2048:             unwritten
>    4:     8192..   10239:      43008..     45055:   2048:             unwritten
>    5:    10240..   12287:      45056..     47103:   2048:             unwritten
>    6:    12288..   14335:      47104..     49151:   2048:             unwritten
> ....
> 
> (This we came from running "fallocate -o 0 -l 512M /mnt/foo" on a
> freshly formatted file system, running Linux 3.12.)
> 
> Compare and contrast that with "mke2fs -T hugefile /tmp/foo.img 1G"
> creates:
> 
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..   32767:      24904..     57671:  32768:            
>    1:    32768..   65535:      57672..     90439:  32768:            
>    2:    65536..   98303:      90440..    123207:  32768:            
>    3:    98304..  131071:     123208..    155975:  32768:            
> 
> This is a bug in how fallocate and mballoc are working together that
> we should fix, of course. :-) And come to think of it, I'm really
> surprised that the extent merging code isn't papering over the fact
> that mballoc is only handing back block allocations 2048 blocks at a
> time.

Does the following still apply for why ext4_can_extents_be_merged() refuses to
allow uninit extents to be merged?

"Make sure that both extents are initialized. We don't merge
uninitialized extents so that we can be sure that end_io code has
the extent that was written properly split out and conversion to
initialized is trivial."

I removed the bits that prevent successful merging of uninit extents and each
2048 block allocation was (sometimes) appended to the prevous extent, but I
didn't check against conversion races.  I'll include the patch at the foot.

> The other difference is the obvious one from the filefrag output,
> which is the data blocks are marked as initialized, instead of
> unwritten.  Yes, this brings up the whole controversy over the
> NO_HIDE_STALE flag, but if you are creating the fresh file system, the
> security issues hopefully not as severe --- and I will eventually add
> support for zero'ing the files, or using discard to zero the data
> blocks, even if at work we really don't care about this because we
> trust the userspace programs that would be using these huge files.

It wouldn't be difficult to have some flags to mark the extent uninit and/or
zero the blocks.  Certainly mke2fs could just zero everything to make life
easier.

> Finally, to help eventually support eventual userspace SMR aware
> applicaitons, one reason why it's useful to have mke2fs support
> creating the huge file is that it's much easier to make sure the file
> is appropriate aligned to begin at an SMR zone boundary.  This is not
> something we currently have any kernel/userspace interfaces to do, in
> terms of telling fallocate that you want to constrain the starting
> block number for the data blocks that you are asking it to
> fallocate(2) for you.

That seems like it would be useful...

> > Is this just to create a test image for e2fsck or similar?
> 
> It is certainly useful for that, but the mk_hugefiles feature is one
> that I expect we would be using on production systems.
> 
> It is definitely the case that writing this code has exposed all sorts
> of interesting bugs and performance shortcomings in libext2fs and
> e2fsprogs in general, so just creating this functionality as part of
> mke2fs it was certainly a useful exercise in and of itself.  :-)
> 
> >  It might make sense to include f_hugefiles/script and expect.1 for it?
> 
> Oh, certainly.  This patch was much more of an RFC than anything else.
> And as I said, I'm still trying to figure out whether or not it makes
> sense to push this code upstream, or leave it as a Google internal
> enhancement.

<shrug> fuse2fs would use it, but I don't know that anyone cares about fuse2fs.

Well, here's a patch for all to enjoy.  xfstests didn't blow up when I ran it.

--D
From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
Subject: [PATCH] ext4: merge uninitialized extents

Allow for merging uninitialized extents.

Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
---
 fs/ext4/extents.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3384dc4..7f0132d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1691,7 +1691,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 	 * the extent that was written properly split out and conversion to
 	 * initialized is trivial.
 	 */
-	if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2))
+	if (ext4_ext_is_uninitialized(ex1) != ext4_ext_is_uninitialized(ex2))
 		return 0;
 
 	ext1_ee_len = ext4_ext_get_actual_len(ex1);
@@ -1708,6 +1708,11 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 	 */
 	if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
 		return 0;
+	if (ext4_ext_is_uninitialized(ex1) &&
+	    (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
+	     atomic_read(&EXT4_I(inode)->i_unwritten) ||
+	     (ext1_ee_len + ext2_ee_len > EXT_UNINIT_MAX_LEN)))
+		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (ext1_ee_len >= 4)
 		return 0;
@@ -1731,7 +1736,7 @@ static int ext4_ext_try_to_merge_right(struct inode *inode,
 {
 	struct ext4_extent_header *eh;
 	unsigned int depth, len;
-	int merge_done = 0;
+	int merge_done = 0, uninit;
 
 	depth = ext_depth(inode);
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1741,8 +1746,11 @@ static int ext4_ext_try_to_merge_right(struct inode *inode,
 		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
 			break;
 		/* merge with next extent! */
+		uninit = ext4_ext_is_uninitialized(ex);
 		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
 				+ ext4_ext_get_actual_len(ex + 1));
+		if (uninit)
+			ext4_ext_mark_uninitialized(ex);
 
 		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - ex - 1)
@@ -1896,7 +1904,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 	struct ext4_ext_path *npath = NULL;
 	int depth, len, err;
 	ext4_lblk_t next;
-	int mb_flags = 0;
+	int mb_flags = 0, uninit;
 
 	if (unlikely(ext4_ext_get_actual_len(newext) == 0)) {
 		EXT4_ERROR_INODE(inode, "ext4_ext_get_actual_len(newext) == 0");
@@ -1946,9 +1954,11 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 						  path + depth);
 			if (err)
 				return err;
-
+			uninit = ext4_ext_is_uninitialized(ex);
 			ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
 					+ ext4_ext_get_actual_len(newext));
+			if (uninit)
+				ext4_ext_mark_uninitialized(ex);
 			eh = path[depth].p_hdr;
 			nearex = ex;
 			goto merge;
@@ -1971,10 +1981,13 @@ prepend:
 			if (err)
 				return err;
 
+			uninit = ext4_ext_is_uninitialized(ex);
 			ex->ee_block = newext->ee_block;
 			ext4_ext_store_pblock(ex, ext4_ext_pblock(newext));
 			ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
 					+ ext4_ext_get_actual_len(newext));
+			if (uninit)
+				ext4_ext_mark_uninitialized(ex);
 			eh = path[depth].p_hdr;
 			nearex = ex;
 			goto merge;
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html