Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space

Andreas Dilger <adilger@xxxxxxxxx> · Mon, 7 Apr 2014 12:22:30 -0600

On Mar 31, 2014, at 11:30 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote:
>> Hi all,
>> 
>> this is the patch I send a while ago to fix the issue I've seen with
>> a global allocation goal. This might no longer apply to the current
>> kernel and it might not be the best approach, but I use this example
>> just to start a discussion about those allocation goals and how to
>> use, or change them.
>> 
>> I think that we agree that the long term fix would be to have free
>> extent map. But we might be able to do something quickly, unless
>> someone commits to make the free extent map reality :)
> 
> Hi Andreas,
> 
> We discussed possibly applying this patch last week at the ext4
> workshop, possibly as early as for the 3.15 merge window:
> 
> 	http://patchwork.ozlabs.org/patch/295956/

Sorry for the delay in replying, I was away the past couple of weeks
on vacation, and am heading this week to the Lustre conference.

> However, I'm guessing that Lustre has the workload which is most
> likely to regress if we were to simply apply this patch.  But, it's
> likely it will improve things for many/most other ext4 workloads.

I definitely agree that this will help with SSD storage, and thinp
workloads that end up doing random IO to the underlying storage.
The main reason for using the global goal is to maximize the chance
of getting large contiguous allocations, since it does a round-robin
traversal of the groups and maximizes the time that other blocks can
be freed in each group.  It also minimizes the churn of allocating
large files in groups that may not have large contiguous extents (e.g.
if small files/extents have just been freed in a group, but not all
of the blocks in that group are freed).

The global goal avoids not only the higher chance of finding free small
chunks for the file, but also the CPU overhead of re-scanning groups
that are partially full to find large free extents.

> We did talk about trying to assemble some block allocation performance
> tests so we can better measure proposed changes to the block
> allocator, but that's not something we have yet.  However, this global
> goal is definitely causing problems for a number of use cases,
> including thinp and being flash friendly.

An important question is whether thinp and flash are the main uses for ext4?
I could imagine ignoring the global goal if rotational == 0, or if some
large fraction of the filesystem was just deleted, but this change means
that each inode will have to re-scan the free groups to find large extents
and I think this will have a long-term negative impact on file extent size.

Note that the global goal is only used if the per-inode goal is already
full (i.e. ext4_mb_regular_allocator->ext4_mb_find_by_goal() fails), so
it will need to find a new allocation every time.

I think typical benchmarks that allocate and then free all space in a new
filesystem will not exercise this code properly.  Something that allocates
space but then only frees some of it, preferably multi-threaded.

> Would you be willing to apply this patch and then run some benchmarks
> to see if Lustre would be impacted negatively if we were to apply this
> patch for the next development cycle (i.e., not for 3.15, but for the
> next merge window)?

Since most of the Lustre developers are travelling this week, it will be
hard to get this patch suitably tested quickly.  Also, Lustre servers are
not running on 3.14 kernels yet, so the patch would need to be backported
to a kernel that the Lustre server is currently running on.

Some minor notes about the patch itself:
- if there is a desire to land this patch quickly, it would be great to
  have this behaviour at least selectable via mount option and/or /sys/fs
  tuneable that turns this off and on.  That would allow the patch to land
  and simplify testing, and disable it if it later shows regressions.
- the i_last_group and i_last_start are unsigned long, but only need to be
  unsigned int like fe_group (ext4_group_t) and fe_start (ext4_grpblk_t).
  That saves 8 bytes per inode that didn't matter in the superblock.
- there is no locking for the update of i_last_{group,start} if there are
  parallel writers on the same inode
- rather than dropping the global goal entirely, it would be better to have
  a list of groups that have relatively free space.  I know we also discussed
  using an rbtree to locate free extents instead of per-group buddy bitmaps.
  That would probably avoid the need for the global goal.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail