Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
> > On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@xxxxxx> wrote:
> >>
> >>
> >>
> >> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
> >>> The LRU map that is preallocated may have its elements reused while
> >>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
> >>> only check_and_free_fields is appropriate when the element is being
> >>> deleted, as it ensures proper synchronization against concurrent access
> >>> of the map value. After that, we cannot call check_and_init_map_value
> >>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
> >>> they can be concurrently accessed from a BPF program.
> >>
> >> If I understand correctly, one lru_node gets freed and pushed to free
> >> list without doing check_and_free_fields().
> >
> > I don't think that's true, there is a check_and_free_fields call on
> > deletion right before bpf_lru_push_free in htab_lru_push_free.
> > Then once the preallocated items are freed on map destruction, we free
> > timers and kptrs again, so if someone has access to preallocated items
> > after freeing e.g. through an earlier lookup, we still release
> > resources they may have created at the end of map's lifetime.
> >
> >> If later the same node is used with bpf_map_update_elem() and
> >> prealloc_lru_pop() is called, then with this patch,
> >> check_and_init_map_value() is not called, so the new node may contain
> >> leftover values for kptr/timer/spin_lock, could this cause a problem?
> >>
> >
> > This can only happen once you touch kptr/timer/spin_lock after
> > deletion's check_and_free_fields call, but the program obtaining the
> > new item will see and be able to handle that case. The timer helpers
> > handle if an existing timer exists, kptr_xchg returns the old pointer
> > as a reference you must release. For unreferenced kptr, it is marked
> > as PTR_UNTRUSTED so a corrupted pointer value is possible but not
> > fatal. If spin_lock is locked on lookup, then the other CPU having
> > access to deleted-but-now-reallocated item will eventually call
> > unlock.
>
> Thanks for explanation. Originally I think we should clear everything
> including spin_lock before putting the deleted lru_node to free list.
> check_and_free_fields() only did for kptr/timer but not spin_lock.
>
> But looks like we should not delete spin_lock before pushing the
> deleted nodes to free list since lookup side may hold a reference
> to the map value and it may have done a bpf_spin_lock() call.
> And we should not clear spin_lock fields in check_and_free_fields()
> and neither in prealloc_lru_pop() in map_update. Otherwise, we
> may have issues for bpf_spin_unlock() on lookup side.
>
> It looks timer and kptr are already been handled for such
> cases (concurrency between map_lookup() and clearing some map_value
> fields for timer/kptr).
>

Yes, I also took a look again at other call sites of
check_and_init_map_value and everything looks sane.

> >
> > It is very much expected, IIUC, that someone else may use-after-free
> > deleted items of hashtab.c maps in case of preallocation. It can be
> > considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
> >
> >> To address the above rewrite issue, maybe the solution should be
> >> to push the deleted lru_nodes back to free list only after
> >> rcu_read_unlock() is done?
> >
> > Please correct me if I'm wrong, but I don't think this is a good idea.
> > Delaying preallocated item reuse for a RCU grace period will greatly
> > increase the probability of running out of preallocated items under
> > load, even though technically those items are free for use.
>
> Agree. This is not a good idea. It increased life cycle for preallocated
> item reuse and will have some performance issue and resource consumption
> issue.
>
> >
> > Side note: I found the problem this patch fixes while reading the
> > code, because I am running into this exact problem with my WIP skip
> > list map implementation, where in the preallocated case, to make
> > things a bit easier for the lockless lookup, I delay reuse of items
> > until an RCU grace period passes (so that the deleted -> unlinked
> > transition does not happen during traversal), but I'm easily able to
> > come up with scenarios (producer/consumer situations) where that leads
> > to exhaustion of the preallocated memory (and if not that, performance
> > degradation on updates because pcpu_freelist now needs to search other
> > CPU's caches more often).
> >
> > BTW, this would be fixed if we could simply use Alexei's per-CPU cache
> > based memory allocator instead of preallocating items, because then
> > the per-CPU cache gets replenished when it goes below the watermark
> > (and also frees stuff back to kernel allocator above the high
> > watermark, which is great for producer/consumer cases with alloc/free
> > imbalance), so we can do the RCU delays similar to kmalloc case
> > without running into the memory exhaustion problem.
>
> Thanks for explanation. So okay the patch looks good to me.
>
> >
> >>
> >>>
> >>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
> >>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
>
> Acked-by: Yonghong Song <yhs@xxxxxx>
>

Thanks! I'll summarize our discussion in short and add it to the commit log.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux