Re: bpf timer design

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 9 Mar 2021 17:19:05 -0800

On Mon, Mar 08, 2021 at 04:11:40PM -0800, Cong Wang wrote:
> Hi, all
> 
> I have been thinking about eBPF timer APIs, it looks harder than I thought.
> 
> The API's themselves are not hard, here is what I have:
> 
> int bpf_timer_setup(struct bpf_timer *timer, void *callback_fn,
>                     void *callback_ctx, u64 flags);
> int bpf_timer_mod(struct bpf_timer *timer, u64 expires);
> int bpf_timer_del(struct bpf_timer *timer);
> 
> which is pretty much similar to the current kernel timer API's.
> 
> The struct bpf_timer is a bit tricky, we still have to save the kernel timer
> there but we do not want eBPF programs to touch it. So I simply save a pointer
> to kernel timer inside:
> 
> struct bpf_timer {
>        void *ptr;
> };
> 
> but we obviously have to prevent eBPF programs from dereferencing it
> with the verifier.
> 
> The hardest part is on the verifier side, we have to check:
> 
> 1. Whether a timer is initialized before use. For example:
> 
> struct bpf_timer t;
> bpf_timer_mod(&t, bpf_jiffies64() + HZ);

relatively easy. see below.

> 2. Whether a timer is still active before exiting. For example:

also easy to do, but "must do bpf_timer_del before exiting" restriction
is probably too limiting to be usable.

> struct bpf_timer t;
> bpf_setup_timer(&t, ....);
> bpf_timer_mod(&t, bpf_jiffies64() + HZ);
> //end of the eBPF program
> 
> I do not see any existing mechanism for checks like these, so potentially need
> a lot of work.

There are two ways to handle the ordering constraints:
- bpf_spin_lock style which is simple.
- may_be_acquire_function/is_release_function which is more advanced.
but the ordering issue is a small one comparing to the issue of life time of
the struct bpf_timer.
I'm assuming that it will be related one to one with struct timer_list.
Ideally the api would be similar to kernel and struct bpf_spin_lock
demonstrated how it can be done. Unlike bpf_spin_lock which is
always unlocked by the time program ends the bpf_timer will be still
enqueued in the timer wheel when prog exits. So its life time should
be separate from the program execution life time.
The only bpf concept with such properties is a bpf map.
Therefore one the ways would be to do each timer as a map element.
Both array of timers and hash of timers would be needed.
The map_lookup_elem would return a pointer to opaque struct bpf_timer.
And then bpf_timer_init/mod/del can operate on that pointer.
The verifier can be taught to check that timer_init() is called
before timer_mod(), but it's simple enough to do in run-time.
The timer_mod() operation is expensive. Few run-time checks
will be negligible. For example bpf_time_mod() can check that
'function' pointer was not-NULL. Otherwise run-time EINVAL.
Since struct bpf_timer is in a map, it's either zero-inited
at element creation time or bpf_timer_init() was called on it.
The bpf_spin_lock has a limitation that it can only be one per map
element which allowed to simplify the verifier code a lot.
I'm suggesting to use the same restriction for bpf_timer.
Implementation wise I hope it won't be as hard coded as process_spin_lock().
High level I'm proposing 'struct bpf_timer { u64 opaque; }'
as part of uapi/bpf.h.
The bpf program can place it inside normal array/hash maps.
The 'opaque' field will contain a pointer to dynamically allocated
'struct timer_list'.
The prog would do:
struct map_elem {
    int stuff;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(struct map_elem *elem)
{
    if (whatever && elem->stuff)
        bpf_timer_mod(&elem->timer, new_expire);
}

int bpf_timer_test(...)
{
    struct map_elem *val;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, timer_cb, flags);
        val->stuff = 123;
        bpf_timer_mod(&val->timer, expires);
    }
}

The map and prog destruction process would need to do del_timer()
on all map elements that have embedded struct bpf_timer before
freeing prog and map. Currently we don't have such constraint
with bpf_spin_lock, but it shouldn't be hard to add.
Similarly bpf_map_delete_elem() would need to do del_timer too.
bpf_map_update_elem can skip touching part of the value
that has struct bpf_timer. Again similar to bpf_spin_lock.
See copy_map_value.
Too really simplify the implementation we can restrict that
either bpf_spin_lock or bpf_timer can be inside the element.
(not both at the same time).

Of course there are other ways to design bpf_timer api.
I think the main advantage of connecting bpf_timer with
a map element is the control of timer life time and
additional data that timer_cb() will receive.
The 'void *callback_ctx' in the beginning of your email has the same
life time issue. It's difficult to make it part of bpf_timer_init().
Instead bpf prog can store additional data into map element.
It's not as flexible as bpf_for_each_map_elem that can
take a pointer to stack, but with timers it's not trivial
to guarantee that the stack is valid at the time callback is fired.
I think it should be flexible enough timer api.