Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > The first request to support timers in bpf was made in 2013 before sys_bpf syscall > was added. That use case was periodic sampling. It was address with attaching > bpf programs to perf_events. Then during XDP development the timers were requested > to do garbage collection and health checks. They were worked around by implementing > timers in user space and triggering progs with BPF_PROG_RUN command. > The user space timers and perf_event+bpf timers are not armed by the bpf program. > They're done asynchronously vs program execution. The XDP program cannot send a > packet and arm the timer at the same time. The tracing prog cannot record an > event and arm the timer right away. This large class of use cases remained > unaddressed. The jiffy based and hrtimer based timers are essential part of the > kernel development and with this patch set the hrtimer based timers will be > available to bpf programs. > > TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added > to make sure bpf progs cannot crash the kernel. > > v2->v3: > The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is > called to make sure callback code doesn't disappear when timer is active and > drop refcnt when timer cb is done. That led to a ton of race conditions between > callback running and concurrent bpf_timer_init/start/cancel on another cpu, > and concurrent bpf_map_update/delete_elem, and map destroy. > > Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all > timers that bpf prog armed in a link list and canceled them when prog refcnt > went to zero. The race conditions disappeared, but timers in map-in-map could > not be supported cleanly, since timers in inner maps have inner map's life time > and don't match prog's life time. > > This v3 approach makes timers to be owned by maps. It allows timers in inner > maps to be supported from the start. This apporach relies on "user refcnt" > scheme used in prog_array that stores bpf programs for bpf_tail_call. The > bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer > callback does decrement the refcnt. The ops->map_release_uref is > responsible for cancelling the timers and dropping prog refcnt when user space > reference to a map is dropped. That addressed all the races and simplified > locking. Great to see this! I missed v2, but the "owned by map + uref" approach makes sense. For the series: Acked-by: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>