Re: XDP/eBPF map thread safety in kernel (e.g. lookup/delete)

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Fri, 25 Oct 2024 13:06:57 +0200

Vadim Goncharov <vadimnuclight@xxxxxxxxx> writes:

>> > OK, no crash is half of good, thanks. But how'd I lock it from
>> > deletion? A "concurrent updates" are usually about memory area that
>> > still persist as a whole, but in my particular case it's a
>> > bpf_timer callback who does bpf_map_delete_elem(). This is a
>> > conntrack-like map - an entry contains some information about
>> > connection, a struct bpf_timer and expire field when this entry
>> > should be deleted due to inactivity, based on information in a
>> > connection (thus *_LRU map types are not suitable for me). So it's
>> > possible for a race condition here:
>> >
>> > 1.  Thread 1 receives packet, does bpf_map_lookup_elem() and begins
>> >     processing, at end of which bpf_timer_start() will be called to
>> >     reschedule removal into future.
>> >
>> > 2.  At moment after lookup, timer callback fires and does
>> >     bpf_map_delete_elem() while first thread is not yet finished.
>> >
>> > So how do I prevent connection record loss in such scenario?  
>> 
>> Yeah, there is no protection against this; you will have to do your
>> own locking to prevent it. Something like putting a spinlock into
>> your data structure and using that to synchronise access, maybe?
>
> Well, if I'll put bpf_spin_lock into map element, then it looks like
> the following scenario is possible:
>
> 1. Thread 1 receives packet, does bpf_map_lookup_elem(map, key)
>
>      1a. Timer callback(map, same_key) does bpf_spin_lock on same_key
>
> 2. Thread 2 does bpf_spin_lock on same_key and waits.
>
>      2a. Timer callback sees lock successful, sets DELETED flag in entry
>          and bpf_map_delete_elem()'s it.
>
> 3. Thread 1 unlocked and either updates map entry directly by pointer,
>     or reinserts new entry reinitializing timer.

Yeah, something like this.

> Am I missing some race condition still present here? atomics or
> something

Unfortunately, I think the delete and update can still race. Deletion is
based on the key only, there is no cmpxchg() semantics. So I don't see
any reason why this sequence of events couldn't happen:

- T2 sets the DELETED flag
- T2 releases the lock, and then gets preempted
- T1 grabs the lock, sees the deleted flag, allocates a new item and
  inserts it with update()
- T2 wakes back up and does a bpf_map_delete() with the expired key,
  which then ends up deleting the new entry

The only way I can think of to avoid this, is if the timer callback uses
bpf_map_lookup_and_delete_item(), then rechecks the value and re-inserts
it if the DELETED flag disappeared between operations. But this still
leaves a window where the entry doesn't exist.

Maybe someone else has a good idea for how to avoid this being racy?

-Toke