Re: [RFC][PATCH 0/1] ext4: Fix ext4_mb_normalize_request

Lukáš Czerner <lczerner@xxxxxxxxxx> · Wed, 25 Jun 2014 16:20:43 +0200 (CEST)

On Wed, 25 Jun 2014, Lukáš Czerner wrote:

> Date: Wed, 25 Jun 2014 15:43:43 +0200 (CEST)
> From: Lukáš Czerner <lczerner@xxxxxxxxxx>
> To: Andreas Dilger <adilger@xxxxxxxxx>
> Cc: Ext4 Developers List <linux-ext4@xxxxxxxxxxxxxxx>,
>     Alex Zhuravlev <alexey.zhuravlev@xxxxxxxxx>
> Subject: Re: [RFC][PATCH 0/1] ext4: Fix ext4_mb_normalize_request
> 
> On Tue, 24 Jun 2014, Andreas Dilger wrote:
> 
> > Date: Tue, 24 Jun 2014 10:25:56 -0600
> > From: Andreas Dilger <adilger@xxxxxxxxx>
> > To: Lukáš Czerner <lczerner@xxxxxxxxxx>
> > Cc: Ext4 Developers List <linux-ext4@xxxxxxxxxxxxxxx>,
> >     Alex Zhuravlev <alexey.zhuravlev@xxxxxxxxx>
> > Subject: Re: [RFC][PATCH 0/1] ext4: Fix ext4_mb_normalize_request
> > 
> > On Jun 24, 2014, at 6:36 AM, Lukáš Czerner <lczerner@xxxxxxxxxx> wrote:
> > > On Fri, 13 Jun 2014, Lukas Czerner wrote:
> > > 
> > >> Date: Fri, 13 Jun 2014 15:55:35 +0200
> > >> From: Lukas Czerner <lczerner@xxxxxxxxxx>
> > >> To: linux-ext4@xxxxxxxxxxxxxxx
> > >> Subject: [RFC][PATCH 0/1] ext4: Fix ext4_mb_normalize_request
> > >> 
> > >> This is my first attempt to fix the ext4_mb_normalize_request()
> > >> function in in ext4 which deals with file preallocations.
> > >> 
> > >> This is not yet a final version as it needs more testing, however
> > >> I'd like to see some suggestions.
> > > 
> > > Does anyone have any comments on this and the related patch ?
> > 
> > Comments inline.
> > 
> > >> Currently there are couple of problems with ext4_mb_normalize_request().
> > >> 
> > >> - We're trying to normalize unwritten extents allocation which is
> > >>  entirely unnecessary, because user exactly knows what how much space
> > >>  he is going to need - no need for file system to do preallocations.
> > >> 
> > >> - ext4_mb_normalize_request() unnecessarily divides bigger allocation
> > >>  requests to small ones (8MB). I believe that this is a bug rather
> > >>  than design.
> > 
> > The reason that the large requests were broken into smaller ones is
> > because it becomes increasingly difficult to find large contiguous
> > extents as the filesystem becomes more full.  If there was a single
> > buddy bitmap for the whole filesystem then it would be possible to
> > scan for e.g. 64MB extents of free blocks, but with the current code
> > this may require loading up a new block bitmap for each allocation.
> > 
> > It may be that with the optimizations that have been landed since the
> > mballoc code was originally written (to cache the largest free extent
> > in memory for each group) that this group descriptor walk may be fast
> > enough to handle large allocations.  In that case, the limit on the
> > number of groups to scan for an allocation may need to be increased.
> 
> I think that it definitelly got better, moreover with features like
> flex_bg we pack it more closely together making it easier for
> readahead to be actually usefull.
> 
> This was mostly a experiment on how can we change it and what the
> results will be. However I am a bit puzzled about your answer. While
> I do understand that making the allocation unnecessarily large is
> harder for the allocator, there is no question about that. We might
> want to add automatic tweak to make it smaller as the file system
> fills up, or as the free space becomes more fragmented. However this
> applies to the rounding up the allocation request for speculative
> preallocation.
> 
> What I described was not the case, it was always capped at 8MB and I
> do not think it was a design decision
> 
> 
> But I do not think this was a design decision but rather a bug. See
> the code...
> 
> 	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
> 					(8<<20)>>bsbits, max, 8 * 1024)) {
> 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
> 							(23 - bsbits)) << 23;
> 		size = 8 * 1024 * 1024;
> 	} else {
> 		start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits;
> 		size	  = ac->ac_o_ex.fe_len << bsbits;
> 	}
> 
> the last "else" is there exactly for the big allocation, however as
> it stands the last else condition will never be false when we get to
> it simply because:
> 
> max = 2 << bsbits;
> 
> #define NRL_CHECK_SIZE(req, size, max, chunk_size)	\
> 		(req <= (size) || max <= (chunk_size))
> 
> which means that (max <= 8 * 1024) will always be true not matter
> what is the block size (well, for the maximum of 4k block size).
> 
> Or do you think that having several smaller allocations is faster
> than one bigger allocation ? I do not think so, if nothing the file
> might get more fragmented as shown by the tests.
> 
> 
> > 
> > >> - For smaller allocations (or smaller files) we do not even respect the
> > >>  fe_logical. Although we do respect it for bigger files.
> > 
> > This is done so it is possible to pack many small allocations into a
> > single large RAID stripe to avoid read-modify-write overhead.
> 
> I do not think that ext4_mb_normalize_request() is supposed to care
> about raid. Also ac_g_ex.fe_logical does not seem to play any role
> in raid alignment, or am I mistaken ?
> 
> In the case of small file it will be set to 0.
> 
> > 
> > >> - Overall the logic within ext4_mb_normalize_request() is weird and
> > >>  no-one really understand why it is the way it is.
> > >> 
> > >> Fix all of this by:
> > >> 
> > >> - Disabling preallocation for unwritten extent allocation. However
> > >>  because the maximum size of the unwritten extent is one block smaller
> > >>  than written, in order to avoid unnecessary fragmentation we limit the
> > >>  request to EXT_INIT_MAX_LEN / 2
> > 
> > That should work out well.  Once the extents are converted to initialized
> > extents they can be merged, and this will also leave some room to split
> > uninitialized extents if they are not completely overwritten.
> > 
> > >> - Get rid of the "if table" in ext4_mb_normalize_request() and replace
> > >>  it with simply aligning the assumed end of the file up to power of
> > >>  two. But we still limit the allocation size to EXT4_BLOCKS_PER_GROUP.
> > >>  Also do this on file system block units to take into account different
> > >>  block sized file systems.
> > 
> > We have a patch to make this table tunable, but in the end we never used
> > anything other than the default power-of-two values, so I don't think it
> > is a loss to remove this code.  That said, it would actually be better to
> > align with the s_stripe_width instead of the next power-of-two value.
> 
> Good point, we can align it to s_stripe_width after rouding up to
> nearest power of two, though for smaller allocation this might not
> be so helpful.
> 
> > 
> > It is important to note that having more extents is not a significant
> > performance impact if they are at least 4-8MB in size, but if the allocator
> > causes read-modify-write on a RAID array can cut performance in half.
> 
> I agree, however ext4_mb_normalize_request() is mostly about
> preallocation, it does not do any allocation decisions based on the
> raid. However I agree that we can align the size to s_stripe_width.
> 
> > 
> > >> It passes xfstests cleanly in default configuration, I've not tried any
> > >> non-default options yet.
> > >> 
> > >> I've tried to test how much it changes allocation. The test and some results
> > >> can be found at
> > >> 
> > >> http://people.redhat.com/lczerner/mballoc/
> > >> 
> > >> normalize.sh is the simple script I run and output.normalize_orig[34]
> > >> contains result from the vanila  3.15.0 while output.normalize_patch[56]
> > >> contains results with this patch.
> > >> 
> > >> From the performance stand point I do not see any major differences except
> > >> that untar seems to always generate better results (which might be because
> > >> of bigger continuous extents).
> > 
> > Actually, looking at the results, while the extent allocations look more
> > "regular" with the patched code, the actual performance is significantly
> > worse for some tests, both in terms of speed and in terms of free space
> > fragmentation:
> 
> The test is far from being perfect and most of the times varies
> widely, but it does not look like it's worse with the patch:
> 
> 
> 		Orig3		Patch5		Orig4		Patch6
> -------------------------------------------------------------------------
> Fallocate	0m0.276s	0m0.324s	0m0.258s	0m0.197s
> Copy		4m58.145s	5m37.090s	4m43.531s	4m5.406s
> Tar		6m10.526s	6m5.518s	5m42.999s	5m32.376s
> Untar		7m23.806s	7m6.787s	7m27.812s	7m11.225s
> Single_dd	0m15.979s	0m16.504s	0m18.130s	0m16.358s
> Multiple_dd	3m13.562s	3m15.353s	3m11.031s	3m14.018s
> Fsstress	0m6.401s	0m7.994s	0m6.555s	0m6.375s
> -------------------------- TEST ON IMAGE FILE ---------------------------
> Fallocate	0m0.216s	0m0.290s	0m0.272s	0m0.206s
> Copy		4m17.888s	5m24.326s	4m14.834s	4m19.867s
> Tar		4m31.356s	4m39.639s	4m33.940s	4m37.231s
> Untar		3m34.945s	3m43.807s	4m4.969s	3m28.390s
> Single_dd	0m14.875s	0m16.735s	0m19.060s	0m16.500s
> Multiple_dd	3m14.358s	3m24.676s	3m21.938s	3m41.215s
> Fsstress	0m11.472s	0m9.079s	0m9.705s	0m11.167s
> 
> Nice table, I should have done that before ;) So you can see that it
> really differs too much. But there are some patterns to be seen.
> 
> Here patched ext4 was faster although the difference is too small to
> be taken seriously. I need to do more runs and proper statistics.
> Tar		6m10.526s	6m5.518s	5m42.999s	5m32.376s
> 
> Here patched ext4 is slower, but the difference is again
> insignificant.
> Multiple_dd	3m13.562s	3m15.353s	3m11.031s	3m14.018s
> 
> So I'd say as fat as running times are concerned the appear to be
> the no significant difference, but more testing and better
> statistics should be done.
> 
> You're right that Avg. free extent seems to be slightly smaller,
> however looking at the free extents histogram I think that Weighted
> Average would show results more favourable for the patches ext4. The

Sorry I meant to say median.

> reason is that currently we're packing allocations closely together
> which will give us a bit more bigger free extents, however we also
> have much more smaller ones. Whole with my patch there seems to be
> less really small extents and more medium sized extents (4M-128M).

Which along with more contiguous file and less extents seems like a
good thing.

-Lukas

> 
> But that needs to be confirmed.
> 
> 
> > 
> > orig3:					patch5:
> > [+] Fallocate test			[+] Fallocate test
> > 
> > real	0m0.216s			real	0m0.290s
> > user	0m0.000s			user	0m0.001s
> > sys	0m0.061s			sys	0m0.037s
> > Device: /dev/loop0			Device: /dev/loop0
> > Blocksize: 4096 bytes			Blocksize: 4096 bytes
> > Total blocks: 22282240			Total blocks: 22282240
> > Free blocks: 3535514 (15.9%)		Free blocks: 3535512 (15.9%)
> > 
> > Min. free extent: 32 KB 		Min. free extent: 32 KB 
> > Max. free extent: 2064256 KB		Max. free extent: 2064256 KB
> > Avg. free extent: 235700 KB		Avg. free extent: 228096 KB
> > 
> > orig3:					patch5:
> > [+] Copy linux source			[+] Copy linux source
> > 
> > real	4m17.888s			real	5m24.326s
> > user	0m2.265s			user	0m2.486s
> > sys	2m4.205s			sys	2m34.918s
> > Device: /dev/loop0			Device: /dev/loop0
> > Blocksize: 4096 bytes			Blocksize: 4096 bytes
> > Total blocks: 22282240			Total blocks: 22282240
> > Free blocks: 17536027 (78.7%)		Free blocks: 17536042 (78.7%)
> > 
> > Min. free extent: 4 KB 			Min. free extent: 4 KB 
> > Max. free extent: 2064256 KB		Max. free extent: 2064256 KB
> > Avg. free extent: 267724 KB		Avg. free extent: 209384 KB
> > 
> > orig3:					patch5:
> > [+] Untar linux source			[+] Untar linux source
> > 
> > real	3m34.945s			real	3m43.807s
> > user	0m3.459s			user	0m3.687s
> > sys	1m35.126s			sys	1m42.839s
> > Device: /dev/loop0			Device: /dev/loop0
> > Blocksize: 4096 bytes			Blocksize: 4096 bytes
> > Total blocks: 22282240			Total blocks: 22282240
> > Free blocks: 8852805 (39.7%)		Free blocks: 8852831 (39.7%)
> > 
> > Min. free extent: 4 KB 			Min. free extent: 4 KB 
> > Max. free extent: 2064256 KB		Max. free extent: 2064256 KB
> > Avg. free extent: 102936 KB		Avg. free extent: 72120 KB
> > 
> > The same is true for the "single dd" and "multiple dd" tests.  The only one
> > that shows somewhat better performance and fragmentation results is fsstress,
> > but I wouldn't exactly call that representative of normal user workloads.
> > 
> > 
> > >> Free space fragmentation seems to be about the same, however with the patch
> > >> there seems to be less smaller free space extents and more bigger ones which
> > >> is expected due to bigger preallocations (and I think it's a good thing).
> > 
> > Hmm, this is exactly the opposite of what I see in the output files?
> 
> Seem my explanation above, I need to do a better job at presenting
> the numbers in more understandable way.
> 
> > 
> > >> The biggest difference which is obvious from the results is that extent tree
> > >> is much smaller (sometimes five times smaller) with the patch. Except of the
> > >> fallocate case because we now limit the requests to (EXT_INIT_MAX_LEN / 2)
> > >> so we can not merge them - it might be worth experimenting with something
> > >> smaller which is a factor of unwritten extent size.
> > >> 
> > >> But as I said the extent tree is much smaller which means that the extents
> > >> overall are bigger which again is a good thing. This becomes very obvious
> > >> when we look at the extent tree of the image file (the last steps in the
> > >> test).
> > >> 
> > >> What do you think ?
> > 
> > I definitely agree that there is work to be done to improve this code, but
> > since the results are quite mixed I think it makes sense to separate out
> > some of the changes into different patches and test them independently.  That
> > will simplify the isolation of which changes are affecting the performance.
> 
> I agree, more experimentation is needed.
> 
> Thanks for taking your time to look at this!
> -Lukas
> 
> > 
> > 
> > Cheers, Andreas
> > 
> > 
> > 
> > 
> > 
> >