Hi Ritesh, On 20.05.2020 08:40, Ritesh Harjani wrote: > There could be a race in function ext4_mb_discard_group_preallocations() > where the 1st thread may iterate through group's bb_prealloc_list and > remove all the PAs and add to function's local list head. > Now if the 2nd thread comes in to discard the group preallocations, > it will see that the group->bb_prealloc_list is empty and will return 0. > > Consider for a case where we have less number of groups > (for e.g. just group 0), > this may even return an -ENOSPC error from ext4_mb_new_blocks() > (where we call for ext4_mb_discard_group_preallocations()). > But that is wrong, since 2nd thread should have waited for 1st thread > to release all the PAs and should have retried for allocation. > Since 1st thread was anyway going to discard the PAs. > > The algorithm using this percpu seq counter goes below: > 1. We sample the percpu discard_pa_seq counter before trying for block > allocation in ext4_mb_new_blocks(). > 2. We increment this percpu discard_pa_seq counter when we either allocate > or free these blocks i.e. while marking those blocks as used/free in > mb_mark_used()/mb_free_blocks(). > 3. We also increment this percpu seq counter when we successfully identify > that the bb_prealloc_list is not empty and hence proceed for discarding > of those PAs inside ext4_mb_discard_group_preallocations(). > > Now to make sure that the regular fast path of block allocation is not > affected, as a small optimization we only sample the percpu seq counter > on that cpu. Only when the block allocation fails and when freed blocks > found were 0, that is when we sample percpu seq counter for all cpus using > below function ext4_get_discard_pa_seq_sum(). This happens after making > sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. > > It can be well argued that why don't just check for grp->bb_free to > see if there are any free blocks to be allocated. So here are the two > concerns which were discussed:- > > 1. If for some reason the blocks available in the group are not > appropriate for allocation logic (say for e.g. > EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then > the retry logic may result into infinte looping since grp->bb_free is > non-zero. > > 2. Also before preallocation was clubbed with block allocation with the > same ext4_lock_group() held, there were lot of races where grp->bb_free > could not be reliably relied upon. > Due to above, this patch considers discard_pa_seq logic to determine if > we should retry for block allocation. Say if there are are n threads > trying for block allocation and none of those could allocate or discard > any of the blocks, then all of those n threads will fail the block > allocation and return -ENOSPC error. (Since the seq counter for all of > those will match as no block allocation/discard was done during that > duration). > > Signed-off-by: Ritesh Harjani <riteshh@xxxxxxxxxxxxx> This patch landed in yesterday's linux-next and causes following WARNING/BUG on various Samsung Exynos-based boards: BUG: using smp_processor_id() in preemptible [00000000] code: logsave/552 caller is ext4_mb_new_blocks+0x404/0x1300 CPU: 3 PID: 552 Comm: logsave Tainted: G W 5.7.0-next-20200602 #4 Hardware name: Samsung Exynos (Flattened Device Tree) [<c011184c>] (unwind_backtrace) from [<c010d250>] (show_stack+0x10/0x14) [<c010d250>] (show_stack) from [<c05185fc>] (dump_stack+0xbc/0xe8) [<c05185fc>] (dump_stack) from [<c0ab689c>] (check_preemption_disabled+0xec/0xf0) [<c0ab689c>] (check_preemption_disabled) from [<c03a7b38>] (ext4_mb_new_blocks+0x404/0x1300) [<c03a7b38>] (ext4_mb_new_blocks) from [<c03780f4>] (ext4_ext_map_blocks+0xc7c/0x10f4) [<c03780f4>] (ext4_ext_map_blocks) from [<c03902b4>] (ext4_map_blocks+0x118/0x5a0) [<c03902b4>] (ext4_map_blocks) from [<c0394524>] (mpage_map_and_submit_extent+0x134/0x9c0) [<c0394524>] (mpage_map_and_submit_extent) from [<c03958c8>] (ext4_writepages+0xb18/0xcb0) [<c03958c8>] (ext4_writepages) from [<c02588ec>] (do_writepages+0x20/0x94) [<c02588ec>] (do_writepages) from [<c024c688>] (__filemap_fdatawrite_range+0xac/0xcc) [<c024c688>] (__filemap_fdatawrite_range) from [<c024c700>] (filemap_flush+0x28/0x30) [<c024c700>] (filemap_flush) from [<c037eedc>] (ext4_release_file+0x70/0xac) [<c037eedc>] (ext4_release_file) from [<c02c3310>] (__fput+0xc4/0x234) [<c02c3310>] (__fput) from [<c014eb74>] (task_work_run+0x88/0xcc) [<c014eb74>] (task_work_run) from [<c010ca40>] (do_work_pending+0x52c/0x5cc) [<c010ca40>] (do_work_pending) from [<c0100094>] (slow_work_pending+0xc/0x20) Exception stack(0xec9c1fb0 to 0xec9c1ff8) 1fa0: 00000000 0044969c 0000006c 00000000 1fc0: 00000001 0045a014 00000241 00000006 00000000 be91abb4 be91abb0 0000000c 1fe0: 00459fd4 be91ab90 00448ed4 b6e43444 60000050 00000003 Please let me know how I can help debugging this issue. The above log is from linux-next 20200602 compiled from exynos_defconfig running on ARM 32bit Samsung Exynos4412-based Odroid U3 board, however I don't think this is Exynos specific issue. Probably I've observed it, because exynos_defconfig has most of the debugging options enabled. > ... Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland