On Tue, Jan 21, 2014 at 04:39:43PM -0500, Theodore Ts'o wrote: > On Tue, Jan 21, 2014 at 11:45:17AM -0700, Andreas Dilger wrote: > > > Then "mke2fs -T hugefile /dev/sdXX" will create as many 1G files > > > needed to fill the file system. > > > > How is this different from using fallocate to allocate the files? > > There are a couple of differences. One is that currently using > fallocate to allocate the file results in an embarassingly bad extent > tree: > > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 2047: 34816.. 36863: 2048: unwritten > 1: 2048.. 4095: 36864.. 38911: 2048: unwritten > 2: 4096.. 6143: 38912.. 40959: 2048: unwritten > 3: 6144.. 8191: 40960.. 43007: 2048: unwritten > 4: 8192.. 10239: 43008.. 45055: 2048: unwritten > 5: 10240.. 12287: 45056.. 47103: 2048: unwritten > 6: 12288.. 14335: 47104.. 49151: 2048: unwritten > .... > > (This we came from running "fallocate -o 0 -l 512M /mnt/foo" on a > freshly formatted file system, running Linux 3.12.) > > Compare and contrast that with "mke2fs -T hugefile /tmp/foo.img 1G" > creates: > > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 32767: 24904.. 57671: 32768: > 1: 32768.. 65535: 57672.. 90439: 32768: > 2: 65536.. 98303: 90440.. 123207: 32768: > 3: 98304.. 131071: 123208.. 155975: 32768: > > This is a bug in how fallocate and mballoc are working together that > we should fix, of course. :-) And come to think of it, I'm really > surprised that the extent merging code isn't papering over the fact > that mballoc is only handing back block allocations 2048 blocks at a > time. Does the following still apply for why ext4_can_extents_be_merged() refuses to allow uninit extents to be merged? "Make sure that both extents are initialized. We don't merge uninitialized extents so that we can be sure that end_io code has the extent that was written properly split out and conversion to initialized is trivial." I removed the bits that prevent successful merging of uninit extents and each 2048 block allocation was (sometimes) appended to the prevous extent, but I didn't check against conversion races. I'll include the patch at the foot. > The other difference is the obvious one from the filefrag output, > which is the data blocks are marked as initialized, instead of > unwritten. Yes, this brings up the whole controversy over the > NO_HIDE_STALE flag, but if you are creating the fresh file system, the > security issues hopefully not as severe --- and I will eventually add > support for zero'ing the files, or using discard to zero the data > blocks, even if at work we really don't care about this because we > trust the userspace programs that would be using these huge files. It wouldn't be difficult to have some flags to mark the extent uninit and/or zero the blocks. Certainly mke2fs could just zero everything to make life easier. > Finally, to help eventually support eventual userspace SMR aware > applicaitons, one reason why it's useful to have mke2fs support > creating the huge file is that it's much easier to make sure the file > is appropriate aligned to begin at an SMR zone boundary. This is not > something we currently have any kernel/userspace interfaces to do, in > terms of telling fallocate that you want to constrain the starting > block number for the data blocks that you are asking it to > fallocate(2) for you. That seems like it would be useful... > > Is this just to create a test image for e2fsck or similar? > > It is certainly useful for that, but the mk_hugefiles feature is one > that I expect we would be using on production systems. > > It is definitely the case that writing this code has exposed all sorts > of interesting bugs and performance shortcomings in libext2fs and > e2fsprogs in general, so just creating this functionality as part of > mke2fs it was certainly a useful exercise in and of itself. :-) > > > It might make sense to include f_hugefiles/script and expect.1 for it? > > Oh, certainly. This patch was much more of an RFC than anything else. > And as I said, I'm still trying to figure out whether or not it makes > sense to push this code upstream, or leave it as a Google internal > enhancement. <shrug> fuse2fs would use it, but I don't know that anyone cares about fuse2fs. Well, here's a patch for all to enjoy. xfstests didn't blow up when I ran it. --D From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> Subject: [PATCH] ext4: merge uninitialized extents Allow for merging uninitialized extents. Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> --- fs/ext4/extents.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 3384dc4..7f0132d 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1691,7 +1691,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, * the extent that was written properly split out and conversion to * initialized is trivial. */ - if (ext4_ext_is_uninitialized(ex1) || ext4_ext_is_uninitialized(ex2)) + if (ext4_ext_is_uninitialized(ex1) != ext4_ext_is_uninitialized(ex2)) return 0; ext1_ee_len = ext4_ext_get_actual_len(ex1); @@ -1708,6 +1708,11 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, */ if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN) return 0; + if (ext4_ext_is_uninitialized(ex1) && + (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) || + atomic_read(&EXT4_I(inode)->i_unwritten) || + (ext1_ee_len + ext2_ee_len > EXT_UNINIT_MAX_LEN))) + return 0; #ifdef AGGRESSIVE_TEST if (ext1_ee_len >= 4) return 0; @@ -1731,7 +1736,7 @@ static int ext4_ext_try_to_merge_right(struct inode *inode, { struct ext4_extent_header *eh; unsigned int depth, len; - int merge_done = 0; + int merge_done = 0, uninit; depth = ext_depth(inode); BUG_ON(path[depth].p_hdr == NULL); @@ -1741,8 +1746,11 @@ static int ext4_ext_try_to_merge_right(struct inode *inode, if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) break; /* merge with next extent! */ + uninit = ext4_ext_is_uninitialized(ex); ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + ext4_ext_get_actual_len(ex + 1)); + if (uninit) + ext4_ext_mark_uninitialized(ex); if (ex + 1 < EXT_LAST_EXTENT(eh)) { len = (EXT_LAST_EXTENT(eh) - ex - 1) @@ -1896,7 +1904,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode, struct ext4_ext_path *npath = NULL; int depth, len, err; ext4_lblk_t next; - int mb_flags = 0; + int mb_flags = 0, uninit; if (unlikely(ext4_ext_get_actual_len(newext) == 0)) { EXT4_ERROR_INODE(inode, "ext4_ext_get_actual_len(newext) == 0"); @@ -1946,9 +1954,11 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode, path + depth); if (err) return err; - + uninit = ext4_ext_is_uninitialized(ex); ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + ext4_ext_get_actual_len(newext)); + if (uninit) + ext4_ext_mark_uninitialized(ex); eh = path[depth].p_hdr; nearex = ex; goto merge; @@ -1971,10 +1981,13 @@ prepend: if (err) return err; + uninit = ext4_ext_is_uninitialized(ex); ex->ee_block = newext->ee_block; ext4_ext_store_pblock(ex, ext4_ext_pblock(newext)); ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + ext4_ext_get_actual_len(newext)); + if (uninit) + ext4_ext_mark_uninitialized(ex); eh = path[depth].p_hdr; nearex = ex; goto merge; -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html