Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop

Yonghong Song <yhs@xxxxxx> · Mon, 8 Aug 2022 16:23:45 -0700

On 8/8/22 11:55 AM, Kumar Kartikeya Dwivedi wrote:
On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@xxxxxx> wrote:

On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@xxxxxx> wrote:

On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
The LRU map that is preallocated may have its elements reused while
another program holds a pointer to it from bpf_map_lookup_elem. Hence,
only check_and_free_fields is appropriate when the element is being
deleted, as it ensures proper synchronization against concurrent access
of the map value. After that, we cannot call check_and_init_map_value
again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
they can be concurrently accessed from a BPF program.

If I understand correctly, one lru_node gets freed and pushed to free
list without doing check_and_free_fields().

I don't think that's true, there is a check_and_free_fields call on
deletion right before bpf_lru_push_free in htab_lru_push_free.
Then once the preallocated items are freed on map destruction, we free
timers and kptrs again, so if someone has access to preallocated items
after freeing e.g. through an earlier lookup, we still release
resources they may have created at the end of map's lifetime.

If later the same node is used with bpf_map_update_elem() and
prealloc_lru_pop() is called, then with this patch,
check_and_init_map_value() is not called, so the new node may contain
leftover values for kptr/timer/spin_lock, could this cause a problem?

This can only happen once you touch kptr/timer/spin_lock after
deletion's check_and_free_fields call, but the program obtaining the
new item will see and be able to handle that case. The timer helpers
handle if an existing timer exists, kptr_xchg returns the old pointer
as a reference you must release. For unreferenced kptr, it is marked
as PTR_UNTRUSTED so a corrupted pointer value is possible but not
fatal. If spin_lock is locked on lookup, then the other CPU having
access to deleted-but-now-reallocated item will eventually call
unlock.

Thanks for explanation. Originally I think we should clear everything
including spin_lock before putting the deleted lru_node to free list.
check_and_free_fields() only did for kptr/timer but not spin_lock.

But looks like we should not delete spin_lock before pushing the
deleted nodes to free list since lookup side may hold a reference
to the map value and it may have done a bpf_spin_lock() call.
And we should not clear spin_lock fields in check_and_free_fields()
and neither in prealloc_lru_pop() in map_update. Otherwise, we
may have issues for bpf_spin_unlock() on lookup side.

It looks timer and kptr are already been handled for such
cases (concurrency between map_lookup() and clearing some map_value
fields for timer/kptr).

Yes, I also took a look again at other call sites of
check_and_init_map_value and everything looks sane.

Sounds good.

It is very much expected, IIUC, that someone else may use-after-free
deleted items of hashtab.c maps in case of preallocation. It can be
considered similar to how SLAB_TYPESAFE_BY_RCU behaves.

To address the above rewrite issue, maybe the solution should be
to push the deleted lru_nodes back to free list only after
rcu_read_unlock() is done?

Please correct me if I'm wrong, but I don't think this is a good idea.
Delaying preallocated item reuse for a RCU grace period will greatly
increase the probability of running out of preallocated items under
load, even though technically those items are free for use.

Agree. This is not a good idea. It increased life cycle for preallocated
item reuse and will have some performance issue and resource consumption
issue.

Side note: I found the problem this patch fixes while reading the
code, because I am running into this exact problem with my WIP skip
list map implementation, where in the preallocated case, to make
things a bit easier for the lockless lookup, I delay reuse of items
until an RCU grace period passes (so that the deleted -> unlinked
transition does not happen during traversal), but I'm easily able to
come up with scenarios (producer/consumer situations) where that leads
to exhaustion of the preallocated memory (and if not that, performance
degradation on updates because pcpu_freelist now needs to search other
CPU's caches more often).

Thinking again. I guess the following scenario is possible:

     rcu_read_lock()
        v = bpf_map_lookup_elem(&key);
        t1 = v->field;
        bpf_map_delete_elem(&key);
           /* another bpf program triggering bpf_map_update_elem() */
           /* the element with 'key' is reused */
           /* the value is updated */
        t2 = v->field;
        ...
     rcu_read_unlock()

it is possible t1 and t2 may not be the same.
This should be an extremely corner case, not sure how to resolve
this easily without performance degradation.

BTW, this would be fixed if we could simply use Alexei's per-CPU cache
based memory allocator instead of preallocating items, because then
the per-CPU cache gets replenished when it goes below the watermark
(and also frees stuff back to kernel allocator above the high
watermark, which is great for producer/consumer cases with alloc/free
imbalance), so we can do the RCU delays similar to kmalloc case
without running into the memory exhaustion problem. >>
Thanks for explanation. So okay the patch looks good to me.

Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>

Acked-by: Yonghong Song <yhs@xxxxxx>

Thanks! I'll summarize our discussion in short and add it to the commit log.