On Tue, Apr 27, 2021 at 11:34 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Apr 27, 2021 at 9:36 AM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > > > > If we enforce this ownership, in case of conntrack the owner would be > > the program which sees the connection first, which is pretty much > > unpredictable. For example, if the ingress program sees a connection > > first, it installs a timer for this connection, but the traffic is > > bidirectional, > > hence egress program needs this connection and its timer too, we > > should not remove this timer when the ingress program is freed. > > Sure. That's trivially achieved with pinning. If users forget to do so, their ebpf program would crash the kernel, right? But ebpf programs should never crash the kernel, right? > One can have an ingress prog that tailcalls into another prog > that arms the timer with one of its subprogs. > Egress prog can tailcall into the same prog as well. > The ingress and egress progs can be replaced one by one > or removed both together and middle prog can stay alive > if it's pinned in bpffs or held alive by FD. This looks necessarily complex. Look at the overhead of using a timer properly here: 1. pin timer callback program 2. a program to install timer 3. a program array contains the above program 4. a tail call into the above program array Why not design a simpler solution? > > > From another point of view: maps and programs are both first-class > > resources in eBPF, a timer is stored in a map and associated with a > > program, so it is naturally a first-class resource too. > > Not really. The timer abstraction is about data. It invokes the callback. > That callback is a part of the program. The lifetime of the timer object > and lifetime of the callback can be different. > Obviously the timer logic need to make sure that callback text is alive > when the timer is armed. Only if the callback could reference struct bpf_prog... And even if it could, how about users forgetting to do so? ebpf verifier has to reject such cases. > Combining timer and callback concepts creates a messy abstraction. > In the normal kernel code one can have a timer in any kernel data > structure and callback in the kernel text or in the kernel module. > The code needs to make sure that the module won't go away while > the timer is armed. Same thing with bpf progs. The progs are safe > kernel modules. The timers are independent objects. Kernel modules can take reference count of its own module very easily, plus there is no verifier for kernel modules. I don't understand why you want to make ebpf programs as close to kernel modules as possible in this case. > > > > > > > > > > > > > > Also if your colleagues have something to share they should be > > > > > posting to the mailing list. Right now you're acting as a broken phone > > > > > passing info back and forth and the knowledge gets lost. > > > > > Please ask your colleagues to participate online. > > > > > > > > They are already in CC from the very beginning. And our use case is > > > > public, it is Cilium conntrack: > > > > https://github.com/cilium/cilium/blob/master/bpf/lib/conntrack.h > > > > > > > > The entries of the code are: > > > > https://github.com/cilium/cilium/blob/master/bpf/bpf_lxc.c > > > > > > > > The maps for conntrack are: > > > > https://github.com/cilium/cilium/blob/master/bpf/lib/conntrack_map.h > > > > > > If that's the only goal then kernel timers are not needed. > > > cilium conntrack works well as-is. > > > > We don't go back to why user-space cleanup is inefficient again, > > do we? ;) > > I remain unconvinced that cilium conntrack _needs_ timer apis. > It works fine in production and I don't hear any complaints > from cilium users. So 'user space cleanup inefficiencies' is > very subjective and cannot be the reason to add timer apis. I am pretty sure I showed the original report to you when I sent timeout hashmap patch, in case you forgot here it is again: https://github.com/cilium/cilium/issues/5048 and let me quote the original report here: "The current implementation (as of v1.2) for managing the contents of the datapath connection tracking map leaves something to be desired: Once per minute, the userspace cilium-agent makes a series of calls to the bpf() syscall to fetch all of the entries in the map to determine whether they should be deleted. For each entry in the map, 2-3 calls must be made: One to fetch the next key, one to fetch the value, and perhaps one to delete the entry. The maximum size of the map is 1 million entries, and if the current count approaches this size then the garbage collection goroutine may spend a significant number of CPU cycles iterating and deleting elements from the conntrack map." (Adding Joe in Cc too.) Thanks.