Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> · Wed, 3 Jun 2020 08:48:51 +0200

Hi Ritesh,

On 20.05.2020 08:40, Ritesh Harjani wrote:
> There could be a race in function ext4_mb_discard_group_preallocations()
> where the 1st thread may iterate through group's bb_prealloc_list and
> remove all the PAs and add to function's local list head.
> Now if the 2nd thread comes in to discard the group preallocations,
> it will see that the group->bb_prealloc_list is empty and will return 0.
>
> Consider for a case where we have less number of groups
> (for e.g. just group 0),
> this may even return an -ENOSPC error from ext4_mb_new_blocks()
> (where we call for ext4_mb_discard_group_preallocations()).
> But that is wrong, since 2nd thread should have waited for 1st thread
> to release all the PAs and should have retried for allocation.
> Since 1st thread was anyway going to discard the PAs.
>
> The algorithm using this percpu seq counter goes below:
> 1. We sample the percpu discard_pa_seq counter before trying for block
>     allocation in ext4_mb_new_blocks().
> 2. We increment this percpu discard_pa_seq counter when we either allocate
>     or free these blocks i.e. while marking those blocks as used/free in
>     mb_mark_used()/mb_free_blocks().
> 3. We also increment this percpu seq counter when we successfully identify
>     that the bb_prealloc_list is not empty and hence proceed for discarding
>     of those PAs inside ext4_mb_discard_group_preallocations().
>
> Now to make sure that the regular fast path of block allocation is not
> affected, as a small optimization we only sample the percpu seq counter
> on that cpu. Only when the block allocation fails and when freed blocks
> found were 0, that is when we sample percpu seq counter for all cpus using
> below function ext4_get_discard_pa_seq_sum(). This happens after making
> sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
>
> It can be well argued that why don't just check for grp->bb_free to
> see if there are any free blocks to be allocated. So here are the two
> concerns which were discussed:-
>
> 1. If for some reason the blocks available in the group are not
>     appropriate for allocation logic (say for e.g.
>     EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
>     the retry logic may result into infinte looping since grp->bb_free is
>     non-zero.
>
> 2. Also before preallocation was clubbed with block allocation with the
>     same ext4_lock_group() held, there were lot of races where grp->bb_free
>     could not be reliably relied upon.
> Due to above, this patch considers discard_pa_seq logic to determine if
> we should retry for block allocation. Say if there are are n threads
> trying for block allocation and none of those could allocate or discard
> any of the blocks, then all of those n threads will fail the block
> allocation and return -ENOSPC error. (Since the seq counter for all of
> those will match as no block allocation/discard was done during that
> duration).
>
> Signed-off-by: Ritesh Harjani <riteshh@xxxxxxxxxxxxx>

This patch landed in yesterday's linux-next and causes following 
WARNING/BUG on various Samsung Exynos-based boards:

  BUG: using smp_processor_id() in preemptible [00000000] code: logsave/552
  caller is ext4_mb_new_blocks+0x404/0x1300
  CPU: 3 PID: 552 Comm: logsave Tainted: G        W 5.7.0-next-20200602 #4
  Hardware name: Samsung Exynos (Flattened Device Tree)
  [<c011184c>] (unwind_backtrace) from [<c010d250>] (show_stack+0x10/0x14)
  [<c010d250>] (show_stack) from [<c05185fc>] (dump_stack+0xbc/0xe8)
  [<c05185fc>] (dump_stack) from [<c0ab689c>] 
(check_preemption_disabled+0xec/0xf0)
  [<c0ab689c>] (check_preemption_disabled) from [<c03a7b38>] 
(ext4_mb_new_blocks+0x404/0x1300)
  [<c03a7b38>] (ext4_mb_new_blocks) from [<c03780f4>] 
(ext4_ext_map_blocks+0xc7c/0x10f4)
  [<c03780f4>] (ext4_ext_map_blocks) from [<c03902b4>] 
(ext4_map_blocks+0x118/0x5a0)
  [<c03902b4>] (ext4_map_blocks) from [<c0394524>] 
(mpage_map_and_submit_extent+0x134/0x9c0)
  [<c0394524>] (mpage_map_and_submit_extent) from [<c03958c8>] 
(ext4_writepages+0xb18/0xcb0)
  [<c03958c8>] (ext4_writepages) from [<c02588ec>] (do_writepages+0x20/0x94)
  [<c02588ec>] (do_writepages) from [<c024c688>] 
(__filemap_fdatawrite_range+0xac/0xcc)
  [<c024c688>] (__filemap_fdatawrite_range) from [<c024c700>] 
(filemap_flush+0x28/0x30)
  [<c024c700>] (filemap_flush) from [<c037eedc>] 
(ext4_release_file+0x70/0xac)
  [<c037eedc>] (ext4_release_file) from [<c02c3310>] (__fput+0xc4/0x234)
  [<c02c3310>] (__fput) from [<c014eb74>] (task_work_run+0x88/0xcc)
  [<c014eb74>] (task_work_run) from [<c010ca40>] 
(do_work_pending+0x52c/0x5cc)
  [<c010ca40>] (do_work_pending) from [<c0100094>] 
(slow_work_pending+0xc/0x20)
  Exception stack(0xec9c1fb0 to 0xec9c1ff8)
  1fa0:                                     00000000 0044969c 0000006c 
00000000
  1fc0: 00000001 0045a014 00000241 00000006 00000000 be91abb4 be91abb0 
0000000c
  1fe0: 00459fd4 be91ab90 00448ed4 b6e43444 60000050 00000003

Please let me know how I can help debugging this issue. The above log is 
from linux-next 20200602 compiled from exynos_defconfig running on ARM 
32bit Samsung Exynos4412-based Odroid U3 board, however I don't think 
this is Exynos specific issue. Probably I've observed it, because 
exynos_defconfig has most of the debugging options enabled.

 > ...

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland