On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@xxxxxx> wrote: > > > > On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote: > > The LRU map that is preallocated may have its elements reused while > > another program holds a pointer to it from bpf_map_lookup_elem. Hence, > > only check_and_free_fields is appropriate when the element is being > > deleted, as it ensures proper synchronization against concurrent access > > of the map value. After that, we cannot call check_and_init_map_value > > again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while > > they can be concurrently accessed from a BPF program. > > If I understand correctly, one lru_node gets freed and pushed to free > list without doing check_and_free_fields(). I don't think that's true, there is a check_and_free_fields call on deletion right before bpf_lru_push_free in htab_lru_push_free. Then once the preallocated items are freed on map destruction, we free timers and kptrs again, so if someone has access to preallocated items after freeing e.g. through an earlier lookup, we still release resources they may have created at the end of map's lifetime. > If later the same node is used with bpf_map_update_elem() and > prealloc_lru_pop() is called, then with this patch, > check_and_init_map_value() is not called, so the new node may contain > leftover values for kptr/timer/spin_lock, could this cause a problem? > This can only happen once you touch kptr/timer/spin_lock after deletion's check_and_free_fields call, but the program obtaining the new item will see and be able to handle that case. The timer helpers handle if an existing timer exists, kptr_xchg returns the old pointer as a reference you must release. For unreferenced kptr, it is marked as PTR_UNTRUSTED so a corrupted pointer value is possible but not fatal. If spin_lock is locked on lookup, then the other CPU having access to deleted-but-now-reallocated item will eventually call unlock. It is very much expected, IIUC, that someone else may use-after-free deleted items of hashtab.c maps in case of preallocation. It can be considered similar to how SLAB_TYPESAFE_BY_RCU behaves. > To address the above rewrite issue, maybe the solution should be > to push the deleted lru_nodes back to free list only after > rcu_read_unlock() is done? Please correct me if I'm wrong, but I don't think this is a good idea. Delaying preallocated item reuse for a RCU grace period will greatly increase the probability of running out of preallocated items under load, even though technically those items are free for use. Side note: I found the problem this patch fixes while reading the code, because I am running into this exact problem with my WIP skip list map implementation, where in the preallocated case, to make things a bit easier for the lockless lookup, I delay reuse of items until an RCU grace period passes (so that the deleted -> unlinked transition does not happen during traversal), but I'm easily able to come up with scenarios (producer/consumer situations) where that leads to exhaustion of the preallocated memory (and if not that, performance degradation on updates because pcpu_freelist now needs to search other CPU's caches more often). BTW, this would be fixed if we could simply use Alexei's per-CPU cache based memory allocator instead of preallocating items, because then the per-CPU cache gets replenished when it goes below the watermark (and also frees stuff back to kernel allocator above the high watermark, which is great for producer/consumer cases with alloc/free imbalance), so we can do the RCU delays similar to kmalloc case without running into the memory exhaustion problem. > > > > > Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.") > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > > --- > > kernel/bpf/hashtab.c | 6 +----- > > 1 file changed, 1 insertion(+), 5 deletions(-) > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > index da7578426a46..4d793a92301b 100644 > > --- a/kernel/bpf/hashtab.c > > +++ b/kernel/bpf/hashtab.c > > @@ -311,12 +311,8 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key, > > struct htab_elem *l; > > > > if (node) { > > - u32 key_size = htab->map.key_size; > > - > > l = container_of(node, struct htab_elem, lru_node); > > - memcpy(l->key, key, key_size); > > - check_and_init_map_value(&htab->map, > > - l->key + round_up(key_size, 8)); > > + memcpy(l->key, key, htab->map.key_size); > > return l; > > } > >