Hello Ritesh, On Fri 11-09-20 18:50:22, Ritesh Harjani wrote: > On 9/10/20 9:47 PM, Jan Kara wrote: > > > --- > > > fs/ext4/mballoc.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > > > index f386fe62727d..fd55264dc3fe 100644 > > > --- a/fs/ext4/mballoc.c > > > +++ b/fs/ext4/mballoc.c > > > @@ -4191,7 +4191,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, > > > INIT_LIST_HEAD(&list); > > > repeat: > > > ext4_lock_group(sb, group); > > > - this_cpu_inc(discard_pa_seq); > > > list_for_each_entry_safe(pa, tmp, > > > &grp->bb_prealloc_list, pa_group_list) { > > > spin_lock(&pa->pa_lock); > > > @@ -4233,6 +4232,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, > > > goto out; > > > } > > > + /* only increase when find reallocation to discard */ > > > + this_cpu_inc(discard_pa_seq); > > > + > > > > This is a good place to increment the counter but I think you also need to > > handle the case: > > > > if (free < needed && busy) { > > busy = 0; > > ext4_unlock_group(sb, group); > > cond_resched(); > > goto repeat; > > } > > > > We can unlock the group here after removing some preallocations and thus > > other processes checking discard_pa_seq could miss we did this. In fact I > > think the code is somewhat buggy here and we should also discard extents > > accumulated on "list" so far before unlocking the group. Ritesh? > > > > mmm, so even though this code is not discarding those buffers b4 > unlocking, but it has still removed those from the grp->bb_prealloc_list > and added it to the local list. And since it will at some point get > scheduled and start operating from repeat: label so functionally wise > this should be ok. Am I missing anything? Yes, this function itself will be working correctly. But if we unlock the group before discarding preallocations we have in our local list the following can happen: TASK0 TASK1 ext4_mb_new_blocks() - no free block found ext4_mb_discard_preallocations_should_retry() ext4_mb_discard_preallocations() ext4_mb_discard_group_preallocations() ext4_lock_group(sb, group); this_cpu_inc(discard_pa_seq); scans group, moves some preallocations to local list if (free < needed && busy) { ext4_unlock_group(sb, group); - finds no pa to discard seq_retry = ext4_get_discard_pa_seq_sum() - ok seq_retry != seq -> retry - new search still didn't return anything because preallocations are sitting in TASK0 local list and are not freed yet > Although I agree, that if we remove at least the current pa's before > unlocking the group may be a good idea, but we should also check > why was this done like this at the first place. If we don't care about somewhat premature ENOSPC error, then not discarding preallocations before unlocking the group is not a problem. That's why previously it was done like this I believe. > I agree with Jan, that we should increment discard_pa_seq once we > actually have something to discard. I should have written a comment here > to explain why we did this here. But I think commit msg should have all > the history (since I have a habit of writing long commit msgs ;) > > But IIRC, it was done since in case if there is a parallel thread which > is discarding all the preallocations so the current thread may return 0 > since it checks the list_empty(&grp->bb_prealloc_list) check in > ext4_mb_discard_group_preallocations() and returns 0 directly. Ah, OK. I forgot about the unlocked check for empty bb_prealloc_list. > And why the discard_pa_seq counter at other places may not help since we > remove the pa nodes from grp->bb_prealloc_list into a local list and then > start operating on that. So meanwhile some thread may comes and just > checks that the list is empty and return 0 while some other thread may > start discarding from it's local list. So I guess the main problem was > that in the current code we remove the pa from grp->bb_prealloc_list and > add it to local list. So if someone else comes and checks that > grp->bb_prealloc_list is empty then it will directly return 0. > > So, maybe we could do something like this then? > > repeat: > ext4_lock_group(sb, group); > - this_cpu_inc(discard_pa_seq); > list_for_each_entry_safe(pa, tmp, > &grp->bb_prealloc_list, pa_group_list) {<...> > > + if (!free) > + this_cpu_inc(discard_pa_seq); // we should do this here before calling > list_del(&pa->pa_group_list); Yup, that looks good. > /* we can trust pa_free ... */ > free += pa->pa_free; > spin_unlock(&pa->pa_lock); > > list_del(&pa->pa_group_list); > list_add(&pa->u.pa_tmp_list, &list); > } > > I have some test cases around this to test for cases which were > failing. Since in world of parallelism you can't be 100% certain of some > corner case (like this one you just reported). > But, I don't have my other box rite now where I kept all of those - > due to some technical issues. I think I should be able to get those by > next week, if not, I anyways will setup my current machine for testing > this. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR