On 2021/5/28 4:18 AM, Andreas Dilger wrote: > On May 26, 2021, at 2:44 AM, Wang Jianchao <jianchao.wan9@xxxxxxxxx> wrote: >> >> Right now, discard is issued and waited to be completed in jbd2 >> commit kthread context after the logs are committed. When large >> amount of files are deleted and discard is flooding, jbd2 commit >> kthread can be blocked for long time. Then all of the metadata >> operations can be blocked to wait the log space. >> >> One case is the page fault path with read mm->mmap_sem held, which >> wants to update the file time but has to wait for the log space. >> When other threads in the task wants to do mmap, then write mmap_sem >> is blocked. Finally all of the following read mmap_sem requirements >> are blocked, even the ps command which need to read the /proc/pid/ >> -cmdline. Our monitor service which needs to read /proc/pid/cmdline >> used to be blocked for 5 mins. >> >> This patch frees the blocks back to buddy after commit and then do >> discard in a async kworker context in fstrim fashion, namely, >> - mark blocks to be discarded as used if they have not been allocated >> - do discard >> - mark them free >> After this, jbd2 commit kthread won't be blocked any more by discard >> and we won't get NOSPC even if the discard is slow or throttled. > > I definitely agree that sharing the existing fstrim functionality makes > the most sense here. Some comments inline on the implementation. > >> Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2 >> Suggested-by: Theodore Ts'o <tytso@xxxxxxx> >> Signed-off-by: Wang Jianchao <wangjianchao@xxxxxxxxxxxx> >> --- >> fs/ext4/ext4.h | 2 + >> fs/ext4/mballoc.c | 162 +++++++++++++++++++++++++++++++++--------------------- >> fs/ext4/mballoc.h | 3 - >> 3 files changed, 101 insertions(+), 66 deletions(-) >> >> @@ -3024,30 +3039,77 @@ static inline int ext4_issue_discard(struct super_block *sb, >> return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0); >> } >> >> -static void ext4_free_data_in_buddy(struct super_block *sb, >> - struct ext4_free_data *entry) >> +static void ext4_discard_work(struct work_struct *work) >> { >> + struct ext4_sb_info *sbi = container_of(work, >> + struct ext4_sb_info, s_discard_work); >> + struct super_block *sb = sbi->s_sb; >> + ext4_group_t ngroups = ext4_get_groups_count(sb); >> + struct ext4_group_info *grp; >> + struct ext4_free_data *fd, *nfd; >> struct ext4_buddy e4b; >> - struct ext4_group_info *db; >> - int err, count = 0, count2 = 0; >> + int i, err; >> + >> + for (i = 0; i < ngroups; i++) { >> + grp = ext4_get_group_info(sb, i); >> + if (RB_EMPTY_ROOT(&grp->bb_discard_root)) >> + continue; > > For large filesystems there may be millions of block groups, so it > seems inefficient to scan all of the groups each time the work queue Yes it seems to be. At the moment I cooked the patch, I thought kwork is running on background, it should not be a big deal. > is run. Having a list of block group numbers, or bitmap/rbtree/xarray > of the group numbers that need to be trimmed may be more efficient? Maybe we can use a bitmap to record the bgs that need to be trimed Best regards Jianchao > > Most of the complexity in the rest of the patch goes away if the trim > tracking is only done on a whole-group basis (min/max or just a single > bit per group). > > Cheers, Andreas > >> - mb_debug(sb, "gonna free %u blocks in group %u (0x%p):", > > > >