Re: [PATCH V3 4/5] ext4: get discard out of jbd2 commit kthread contex

Jan Kara <jack@xxxxxxx> · Wed, 4 Aug 2021 17:45:30 +0200

On Sat 24-07-21 15:41:23, Wang Jianchao wrote:
> From: Wang Jianchao <wangjianchao@xxxxxxxxxxxx>
> 
> Right now, discard is issued and waited to be completed in jbd2
> commit kthread context after the logs are committed. When large
> amount of files are deleted and discard is flooding, jbd2 commit
> kthread can be blocked for long time. Then all of the metadata
> operations can be blocked to wait the log space.
> 
> One case is the page fault path with read mm->mmap_sem held, which
> wants to update the file time but has to wait for the log space.
> When other threads in the task wants to do mmap, then write mmap_sem
> is blocked. Finally all of the following read mmap_sem requirements
> are blocked, even the ps command which need to read the /proc/pid/
> -cmdline. Our monitor service which needs to read /proc/pid/cmdline
> used to be blocked for 5 mins.
> 
> This patch frees the blocks back to buddy after commit and then do
> discard in a async kworker context in fstrim fashion, namely,
>  - mark blocks to be discarded as used if they have not been allocated
>  - do discard
>  - mark them free
> After this, jbd2 commit kthread won't be blocked any more by discard
> and we won't get NOSPC even if the discard is slow or throttled.
> 
> Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
> Suggested-by: Theodore Ts'o <tytso@xxxxxxx>
> Signed-off-by: Wang Jianchao <wangjianchao@xxxxxxxxxxxx>

Looks good to me. Just one small comment below. With that addressed feel
free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

> @@ -3474,6 +3530,14 @@ int ext4_mb_release(struct super_block *sb)
>  	struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits);
>  	int count;
>  
> +	if (test_opt(sb, DISCARD)) {
> +		/*
> +		 * wait the discard work to drain all of ext4_free_data
> +		 */
> +		queue_work(ext4_discard_wq, &sbi->s_discard_work);

Do we really need to queue the work here? The filesystem should be
quiescent by now, we take care to queue the work whenever we add item to
empty list. So it should be enough to have flush_work() here and then
possibly

	WARN_ON_ONCE(!list_empty(&sbi->s_discard_list))

Or am I missing something?

								Honza

> +		flush_work(&sbi->s_discard_work);
> +	}
> +
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR