On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@xxxxxx> wrote: > > > > On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote: > > On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@xxxxxx> wrote: > >> > >> > >> > >> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote: > >>> The LRU map that is preallocated may have its elements reused while > >>> another program holds a pointer to it from bpf_map_lookup_elem. Hence, > >>> only check_and_free_fields is appropriate when the element is being > >>> deleted, as it ensures proper synchronization against concurrent access > >>> of the map value. After that, we cannot call check_and_init_map_value > >>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while > >>> they can be concurrently accessed from a BPF program. > >> > >> If I understand correctly, one lru_node gets freed and pushed to free > >> list without doing check_and_free_fields(). > > > > I don't think that's true, there is a check_and_free_fields call on > > deletion right before bpf_lru_push_free in htab_lru_push_free. > > Then once the preallocated items are freed on map destruction, we free > > timers and kptrs again, so if someone has access to preallocated items > > after freeing e.g. through an earlier lookup, we still release > > resources they may have created at the end of map's lifetime. > > > >> If later the same node is used with bpf_map_update_elem() and > >> prealloc_lru_pop() is called, then with this patch, > >> check_and_init_map_value() is not called, so the new node may contain > >> leftover values for kptr/timer/spin_lock, could this cause a problem? > >> > > > > This can only happen once you touch kptr/timer/spin_lock after > > deletion's check_and_free_fields call, but the program obtaining the > > new item will see and be able to handle that case. The timer helpers > > handle if an existing timer exists, kptr_xchg returns the old pointer > > as a reference you must release. For unreferenced kptr, it is marked > > as PTR_UNTRUSTED so a corrupted pointer value is possible but not > > fatal. If spin_lock is locked on lookup, then the other CPU having > > access to deleted-but-now-reallocated item will eventually call > > unlock. > > Thanks for explanation. Originally I think we should clear everything > including spin_lock before putting the deleted lru_node to free list. > check_and_free_fields() only did for kptr/timer but not spin_lock. > > But looks like we should not delete spin_lock before pushing the > deleted nodes to free list since lookup side may hold a reference > to the map value and it may have done a bpf_spin_lock() call. > And we should not clear spin_lock fields in check_and_free_fields() > and neither in prealloc_lru_pop() in map_update. Otherwise, we > may have issues for bpf_spin_unlock() on lookup side. > > It looks timer and kptr are already been handled for such > cases (concurrency between map_lookup() and clearing some map_value > fields for timer/kptr). > Yes, I also took a look again at other call sites of check_and_init_map_value and everything looks sane. > > > > It is very much expected, IIUC, that someone else may use-after-free > > deleted items of hashtab.c maps in case of preallocation. It can be > > considered similar to how SLAB_TYPESAFE_BY_RCU behaves. > > > >> To address the above rewrite issue, maybe the solution should be > >> to push the deleted lru_nodes back to free list only after > >> rcu_read_unlock() is done? > > > > Please correct me if I'm wrong, but I don't think this is a good idea. > > Delaying preallocated item reuse for a RCU grace period will greatly > > increase the probability of running out of preallocated items under > > load, even though technically those items are free for use. > > Agree. This is not a good idea. It increased life cycle for preallocated > item reuse and will have some performance issue and resource consumption > issue. > > > > > Side note: I found the problem this patch fixes while reading the > > code, because I am running into this exact problem with my WIP skip > > list map implementation, where in the preallocated case, to make > > things a bit easier for the lockless lookup, I delay reuse of items > > until an RCU grace period passes (so that the deleted -> unlinked > > transition does not happen during traversal), but I'm easily able to > > come up with scenarios (producer/consumer situations) where that leads > > to exhaustion of the preallocated memory (and if not that, performance > > degradation on updates because pcpu_freelist now needs to search other > > CPU's caches more often). > > > > BTW, this would be fixed if we could simply use Alexei's per-CPU cache > > based memory allocator instead of preallocating items, because then > > the per-CPU cache gets replenished when it goes below the watermark > > (and also frees stuff back to kernel allocator above the high > > watermark, which is great for producer/consumer cases with alloc/free > > imbalance), so we can do the RCU delays similar to kmalloc case > > without running into the memory exhaustion problem. > > Thanks for explanation. So okay the patch looks good to me. > > > > >> > >>> > >>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.") > >>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > > Acked-by: Yonghong Song <yhs@xxxxxx> > Thanks! I'll summarize our discussion in short and add it to the commit log.