Re: [RFC][PATCH 0/1] ext4: Fix ext4_mb_normalize_request

Andreas Dilger <adilger@xxxxxxxxx> · Tue, 24 Jun 2014 10:25:56 -0600

On Jun 24, 2014, at 6:36 AM, Lukáš Czerner <lczerner@xxxxxxxxxx> wrote:
> On Fri, 13 Jun 2014, Lukas Czerner wrote:
> 
>> Date: Fri, 13 Jun 2014 15:55:35 +0200
>> From: Lukas Czerner <lczerner@xxxxxxxxxx>
>> To: linux-ext4@xxxxxxxxxxxxxxx
>> Subject: [RFC][PATCH 0/1] ext4: Fix ext4_mb_normalize_request
>> 
>> This is my first attempt to fix the ext4_mb_normalize_request()
>> function in in ext4 which deals with file preallocations.
>> 
>> This is not yet a final version as it needs more testing, however
>> I'd like to see some suggestions.
> 
> Does anyone have any comments on this and the related patch ?

Comments inline.

>> Currently there are couple of problems with ext4_mb_normalize_request().
>> 
>> - We're trying to normalize unwritten extents allocation which is
>>  entirely unnecessary, because user exactly knows what how much space
>>  he is going to need - no need for file system to do preallocations.
>> 
>> - ext4_mb_normalize_request() unnecessarily divides bigger allocation
>>  requests to small ones (8MB). I believe that this is a bug rather
>>  than design.

The reason that the large requests were broken into smaller ones is
because it becomes increasingly difficult to find large contiguous
extents as the filesystem becomes more full.  If there was a single
buddy bitmap for the whole filesystem then it would be possible to
scan for e.g. 64MB extents of free blocks, but with the current code
this may require loading up a new block bitmap for each allocation.

It may be that with the optimizations that have been landed since the
mballoc code was originally written (to cache the largest free extent
in memory for each group) that this group descriptor walk may be fast
enough to handle large allocations.  In that case, the limit on the
number of groups to scan for an allocation may need to be increased.

>> - For smaller allocations (or smaller files) we do not even respect the
>>  fe_logical. Although we do respect it for bigger files.

This is done so it is possible to pack many small allocations into a
single large RAID stripe to avoid read-modify-write overhead.

>> - Overall the logic within ext4_mb_normalize_request() is weird and
>>  no-one really understand why it is the way it is.
>> 
>> Fix all of this by:
>> 
>> - Disabling preallocation for unwritten extent allocation. However
>>  because the maximum size of the unwritten extent is one block smaller
>>  than written, in order to avoid unnecessary fragmentation we limit the
>>  request to EXT_INIT_MAX_LEN / 2

That should work out well.  Once the extents are converted to initialized
extents they can be merged, and this will also leave some room to split
uninitialized extents if they are not completely overwritten.

>> - Get rid of the "if table" in ext4_mb_normalize_request() and replace
>>  it with simply aligning the assumed end of the file up to power of
>>  two. But we still limit the allocation size to EXT4_BLOCKS_PER_GROUP.
>>  Also do this on file system block units to take into account different
>>  block sized file systems.

We have a patch to make this table tunable, but in the end we never used
anything other than the default power-of-two values, so I don't think it
is a loss to remove this code.  That said, it would actually be better to
align with the s_stripe_width instead of the next power-of-two value.

It is important to note that having more extents is not a significant
performance impact if they are at least 4-8MB in size, but if the allocator
causes read-modify-write on a RAID array can cut performance in half.

>> It passes xfstests cleanly in default configuration, I've not tried any
>> non-default options yet.
>> 
>> I've tried to test how much it changes allocation. The test and some results
>> can be found at
>> 
>> http://people.redhat.com/lczerner/mballoc/
>> 
>> normalize.sh is the simple script I run and output.normalize_orig[34]
>> contains result from the vanila  3.15.0 while output.normalize_patch[56]
>> contains results with this patch.
>> 
>> From the performance stand point I do not see any major differences except
>> that untar seems to always generate better results (which might be because
>> of bigger continuous extents).

Actually, looking at the results, while the extent allocations look more
"regular" with the patched code, the actual performance is significantly
worse for some tests, both in terms of speed and in terms of free space
fragmentation:

orig3:					patch5:
[+] Fallocate test			[+] Fallocate test

real	0m0.216s			real	0m0.290s
user	0m0.000s			user	0m0.001s
sys	0m0.061s			sys	0m0.037s
Device: /dev/loop0			Device: /dev/loop0
Blocksize: 4096 bytes			Blocksize: 4096 bytes
Total blocks: 22282240			Total blocks: 22282240
Free blocks: 3535514 (15.9%)		Free blocks: 3535512 (15.9%)

Min. free extent: 32 KB 		Min. free extent: 32 KB 
Max. free extent: 2064256 KB		Max. free extent: 2064256 KB
Avg. free extent: 235700 KB		Avg. free extent: 228096 KB

orig3:					patch5:
[+] Copy linux source			[+] Copy linux source

real	4m17.888s			real	5m24.326s
user	0m2.265s			user	0m2.486s
sys	2m4.205s			sys	2m34.918s
Device: /dev/loop0			Device: /dev/loop0
Blocksize: 4096 bytes			Blocksize: 4096 bytes
Total blocks: 22282240			Total blocks: 22282240
Free blocks: 17536027 (78.7%)		Free blocks: 17536042 (78.7%)

Min. free extent: 4 KB 			Min. free extent: 4 KB 
Max. free extent: 2064256 KB		Max. free extent: 2064256 KB
Avg. free extent: 267724 KB		Avg. free extent: 209384 KB

orig3:					patch5:
[+] Untar linux source			[+] Untar linux source

real	3m34.945s			real	3m43.807s
user	0m3.459s			user	0m3.687s
sys	1m35.126s			sys	1m42.839s
Device: /dev/loop0			Device: /dev/loop0
Blocksize: 4096 bytes			Blocksize: 4096 bytes
Total blocks: 22282240			Total blocks: 22282240
Free blocks: 8852805 (39.7%)		Free blocks: 8852831 (39.7%)

Min. free extent: 4 KB 			Min. free extent: 4 KB 
Max. free extent: 2064256 KB		Max. free extent: 2064256 KB
Avg. free extent: 102936 KB		Avg. free extent: 72120 KB

The same is true for the "single dd" and "multiple dd" tests.  The only one
that shows somewhat better performance and fragmentation results is fsstress,
but I wouldn't exactly call that representative of normal user workloads.

>> Free space fragmentation seems to be about the same, however with the patch
>> there seems to be less smaller free space extents and more bigger ones which
>> is expected due to bigger preallocations (and I think it's a good thing).

Hmm, this is exactly the opposite of what I see in the output files?

>> The biggest difference which is obvious from the results is that extent tree
>> is much smaller (sometimes five times smaller) with the patch. Except of the
>> fallocate case because we now limit the requests to (EXT_INIT_MAX_LEN / 2)
>> so we can not merge them - it might be worth experimenting with something
>> smaller which is a factor of unwritten extent size.
>> 
>> But as I said the extent tree is much smaller which means that the extents
>> overall are bigger which again is a good thing. This becomes very obvious
>> when we look at the extent tree of the image file (the last steps in the
>> test).
>> 
>> What do you think ?

I definitely agree that there is work to be done to improve this code, but
since the results are quite mixed I think it makes sense to separate out
some of the changes into different patches and test them independently.  That
will simplify the isolation of which changes are affecting the performance.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail