Re: [PATCH V2 7/7] ext4: get discard out of jbd2 commit kthread contex

Wang Jianchao <jianchao.wan9@xxxxxxxxx> · Fri, 28 May 2021 11:06:52 +0800

On 2021/5/28 4:18 AM, Andreas Dilger wrote:
> On May 26, 2021, at 2:44 AM, Wang Jianchao <jianchao.wan9@xxxxxxxxx> wrote:
>>
>> Right now, discard is issued and waited to be completed in jbd2
>> commit kthread context after the logs are committed. When large
>> amount of files are deleted and discard is flooding, jbd2 commit
>> kthread can be blocked for long time. Then all of the metadata
>> operations can be blocked to wait the log space.
>>
>> One case is the page fault path with read mm->mmap_sem held, which
>> wants to update the file time but has to wait for the log space.
>> When other threads in the task wants to do mmap, then write mmap_sem
>> is blocked. Finally all of the following read mmap_sem requirements
>> are blocked, even the ps command which need to read the /proc/pid/
>> -cmdline. Our monitor service which needs to read /proc/pid/cmdline
>> used to be blocked for 5 mins.
>>
>> This patch frees the blocks back to buddy after commit and then do
>> discard in a async kworker context in fstrim fashion, namely,
>> - mark blocks to be discarded as used if they have not been allocated
>> - do discard
>> - mark them free
>> After this, jbd2 commit kthread won't be blocked any more by discard
>> and we won't get NOSPC even if the discard is slow or throttled.
> 
> I definitely agree that sharing the existing fstrim functionality makes
> the most sense here.  Some comments inline on the implementation.
> 
>> Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
>> Suggested-by: Theodore Ts'o <tytso@xxxxxxx>
>> Signed-off-by: Wang Jianchao <wangjianchao@xxxxxxxxxxxx>
>> ---
>> fs/ext4/ext4.h    |   2 +
>> fs/ext4/mballoc.c | 162 +++++++++++++++++++++++++++++++++---------------------
>> fs/ext4/mballoc.h |   3 -
>> 3 files changed, 101 insertions(+), 66 deletions(-)
>>
>> @@ -3024,30 +3039,77 @@ static inline int ext4_issue_discard(struct super_block *sb,
>> 		return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0);
>> }
>>
>> -static void ext4_free_data_in_buddy(struct super_block *sb,
>> -				    struct ext4_free_data *entry)
>> +static void ext4_discard_work(struct work_struct *work)
>> {
>> +	struct ext4_sb_info *sbi = container_of(work,
>> +			struct ext4_sb_info, s_discard_work);
>> +	struct super_block *sb = sbi->s_sb;
>> +	ext4_group_t ngroups = ext4_get_groups_count(sb);
>> +	struct ext4_group_info *grp;
>> +	struct ext4_free_data *fd, *nfd;
>> 	struct ext4_buddy e4b;
>> -	struct ext4_group_info *db;
>> -	int err, count = 0, count2 = 0;
>> +	int i, err;
>> +
>> +	for (i = 0; i < ngroups; i++) {
>> +		grp = ext4_get_group_info(sb, i);
>> +		if (RB_EMPTY_ROOT(&grp->bb_discard_root))
>> +			continue;
> 
> For large filesystems there may be millions of block groups, so it
> seems inefficient to scan all of the groups each time the work queue

Yes it seems to be. At the moment I cooked the patch, I thought kwork is
running on background, it should not be a big deal. 

> is run.  Having a list of block group numbers, or bitmap/rbtree/xarray
> of the group numbers that need to be trimmed may be more efficient?

Maybe we can use a bitmap to record the bgs that need to be trimed

Best regards
Jianchao

> 
> Most of the complexity in the rest of the patch goes away if the trim
> tracking is only done on a whole-group basis (min/max or just a single
> bit per group).
> 
> Cheers, Andreas
> 
>> -	mb_debug(sb, "gonna free %u blocks in group %u (0x%p):",

> 
> 
> 
>