On 7/11/23 8:51 PM, Andrii Nakryiko wrote:
On Mon, Jul 10, 2023 at 5:23 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
On Mon, Jul 10, 2023 at 10:12:11PM +0200, Daniel Borkmann wrote:
+ *
+ * struct bpf_mprog_entry *entry, *peer;
+ * int ret;
+ *
+ * // bpf_mprog user-side lock
+ * // fetch active @entry from attach location
+ * [...]
+ * ret = bpf_mprog_attach(entry, [...]);
+ * if (ret >= 0) {
+ * peer = bpf_mprog_peer(entry);
+ * if (bpf_mprog_swap_entries(ret))
+ * // swap @entry to @peer at attach location
+ * bpf_mprog_commit(entry);
+ * ret = 0;
+ * } else {
+ * // error path, bail out, propagate @ret
+ * }
+ * // bpf_mprog user-side unlock
+ *
+ * Detach case:
+ *
+ * struct bpf_mprog_entry *entry, *peer;
+ * bool release;
+ * int ret;
+ *
+ * // bpf_mprog user-side lock
+ * // fetch active @entry from attach location
+ * [...]
+ * ret = bpf_mprog_detach(entry, [...]);
+ * if (ret >= 0) {
+ * release = ret == BPF_MPROG_FREE;
+ * peer = release ? NULL : bpf_mprog_peer(entry);
+ * if (bpf_mprog_swap_entries(ret))
+ * // swap @entry to @peer at attach location
+ * bpf_mprog_commit(entry);
+ * if (release)
+ * // free bpf_mprog_bundle
+ * ret = 0;
+ * } else {
+ * // error path, bail out, propagate @ret
+ * }
+ * // bpf_mprog user-side unlock
Thanks for the doc. It helped a lot.
And when it's contained like this it's easier to discuss api.
It seems bpf_mprog_swap_entries() is trying to abstract the error code
away, but BPF_MPROG_FREE leaks out and tcx_entry_needs_release()
captures it with extra miniq_active twist, which I don't understand yet.
bpf_mprog_peer() is also leaking a bit of implementation detail.
Can we abstract it further, like:
ret = bpf_mprog_detach(entry, [...], &new_entry);
if (ret >= 0) {
if (entry != new_entry)
// swap @entry to @new_entry at attach location
bpf_mprog_commit(entry);
if (!new_entry)
// free bpf_mprog_bundle
}
and make bpf_mprog_peer internal to mprog. It will also allow removing
BPF_MPROG_FREE vs SWAP distinction. peer is hidden.
if (entry != new_entry)
// update
also will be easier to read inside tcx code without looking into mprog details.
+1, agree, and I implemented this suggestion in the v5.
I'm actually thinking if it's possible to simplify it even further.
For example, do we even need a separate bpf_mprog_{attach,detach} and
bpf_mprog_commit()? So far it seems like bpf_mprog_commit() is
inevitable in case of success of attach/detach, so we might as well
just do it as the last step of attach/detach operation.
It needs to be done after the pointers have been swapped by the mprog user.
The only problem seems to be due to bpf_mprog interface doing this
optimization of replacing stuff in place, if possible, and allowing
the caller to not do the swap. How important is it to avoid that swap
of a bpf_mprog_fp (pointer)? Seems pretty cheap (and relatively rare
operation), so I wouldn't bother optimizing this.
I would like to keep it given e.g. when application comes up, fetches links
from bpffs and updates all its programs in place, then this is the replace
situation.
So how about we just say that there is always a swap. Internally in
bpf_mprog_bundle current entry is determined based on revision&1. We
can have bpf_mprog_cur_entry() to return a proper pointer after
commit. Or bpf_mprog_attach() can return proper new entry as output
parameter, whichever is preferable.
As for BPF_MPROG_FREE. That seems like an unnecessary complication as
well. Caller can just check bpf_mprog_total() quickly, and if it
dropped to zero assume FREE. Unless there is something more subtle
there?
Agree, some may want to keep an empty bpf_mprog, others may want to
free it. I implemented it this way. I removed all the BPF_MPROG_*
return codes.
With the above, the interface will be much simpler, IMO. You just do
bpf_mprog_attach/detach, and then swap pointer to new bpf_mprog_entry.
Then you can check bpf_mprog_total() for zero, and clean up further,
if necessary.
We assume the caller has a proper locking, so all the above should be non-racy.
BTW, combining commit with attach allows us to avoid that relatively
big bpf_mprog_cp array on the stack as well, because we will be able
to update bundle->cp_items in-place.
The only (I believe :) ) big assumption I'm making in all of the above
is that commit is inevitable and we won't have a situation where we
start attach, update fp/cpp, and then decide to abort instead of going
for commit. Is this possible? Can we avoid it by careful checks
upfront and doing attach as last step that cannot be undone?
P.S. I guess one bit that I might have simplified is that
synchronize_rcu() + bpf_prog_put(), but I'm not sure exactly why we
put prog after sync_rcu. But if it's really necessary (and I assume it
It is because users can still be inflight on the old mprog_entry so it
must come after the sync rcu where we drop ref for the delete case.
Thanks again,
Daniel