Martin Lau <kafai@xxxxxx> [Thu, 2019-12-12 10:18 -0800]: > On Tue, Dec 10, 2019 at 06:33:29PM -0800, Andrey Ignatov wrote: > > The common use-case in production is to have multiple cgroup-bpf > > programs per attach type that cover multiple use-cases. Such programs > > are attached with BPF_F_ALLOW_MULTI and can be maintained by different > > people. > > > > Order of programs usually matters, for example imagine two egress > > programs: the first one drops packets and the second one counts packets. > > If they're swapped the result of counting program will be different. > > > > It brings operational challenges with updating cgroup-bpf program(s) > > attached with BPF_F_ALLOW_MULTI since there is no way to replace a > > program: > > > > * One way to update is to detach all programs first and then attach the > > new version(s) again in the right order. This introduces an > > interruption in the work a program is doing and may not be acceptable > > (e.g. if it's egress firewall); > > > > * Another way is attach the new version of a program first and only then > > detach the old version. This introduces the time interval when two > > versions of same program are working, what may not be acceptable if a > > program is not idempotent. It also imposes additional burden on > > program developers to make sure that two versions of their program can > > co-exist. > > > > Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH > > command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI > > flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag > > and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to > > replace > > > > That way user can replace any program among those attached with > > BPF_F_ALLOW_MULTI flag without the problems described above. > > > > Details of the new API: > > > > * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid > > descriptor of BPF program, BPF_PROG_ATTACH will return corresponding > > error (EINVAL or EBADF). > > > > * If replace_bpf_fd has valid descriptor of BPF program but such a > > program is not attached to specified cgroup, BPF_PROG_ATTACH will > > return ENOENT. > > > > BPF_F_REPLACE is introduced to make the user intend clear, since > > replace_bpf_fd alone can't be used for this (its default value, 0, is a > > valid fd). BPF_F_REPLACE also makes it possible to extend the API in the > > future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed). > Thanks for the details explanation. > > [ ... ] > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > index 283efe3ce052..45346c79613a 100644 > > --- a/kernel/bpf/cgroup.c > > +++ b/kernel/bpf/cgroup.c > > @@ -282,14 +282,17 @@ static int update_effective_progs(struct cgroup *cgrp, > > * propagate the change to descendants > > * @cgrp: The cgroup which descendants to traverse > > * @prog: A program to attach > > + * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set > > * @type: Type of attach operation > > * @flags: Option flags > > * > > * Must be called with cgroup_mutex held. > > */ > > int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, > > + struct bpf_prog *replace_prog, > > enum bpf_attach_type type, u32 flags) > > { > > + u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI)); > > struct list_head *progs = &cgrp->bpf.progs[type]; > > struct bpf_prog *old_prog = NULL; > > struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE], > > @@ -298,14 +301,15 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, > > enum bpf_cgroup_storage_type stype; > > int err; > > > > - if ((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) > > + if (((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) || > > + ((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI))) > > /* invalid combination */ > > return -EINVAL; > > > > if (!hierarchy_allows_attach(cgrp, type)) > > return -EPERM; > > > > - if (!list_empty(progs) && cgrp->bpf.flags[type] != flags) > > + if (!list_empty(progs) && cgrp->bpf.flags[type] != saved_flags) > > /* Disallow attaching non-overridable on top > > * of existing overridable in this cgroup. > > * Disallow attaching multi-prog if overridable or none > > @@ -320,7 +324,12 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, > > if (pl->prog == prog) > > /* disallow attaching the same prog twice */ > > return -EINVAL; > > + if (pl->prog == replace_prog) > > + replace_pl = pl; > > } > > + if ((flags & BPF_F_REPLACE) && !replace_pl) > > + /* prog to replace not found for cgroup */ > > + return -ENOENT; > > } else if (!list_empty(progs)) { > > replace_pl = list_first_entry(progs, typeof(*pl), node); > > } > > @@ -356,7 +365,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, > > for_each_cgroup_storage_type(stype) > > pl->storage[stype] = storage[stype]; > > > > - cgrp->bpf.flags[type] = flags; > > + cgrp->bpf.flags[type] = saved_flags; > > > > err = update_effective_progs(cgrp, type); > > if (err) > > @@ -522,6 +531,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, > > int cgroup_bpf_prog_attach(const union bpf_attr *attr, > > enum bpf_prog_type ptype, struct bpf_prog *prog) > > { > > + struct bpf_prog *replace_prog = NULL; > > struct cgroup *cgrp; > > int ret; > > > > @@ -529,8 +539,20 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr, > > if (IS_ERR(cgrp)) > > return PTR_ERR(cgrp); > > > > - ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type, > > + if ((attr->attach_flags & BPF_F_ALLOW_MULTI) && > > + (attr->attach_flags & BPF_F_REPLACE)) { > The patch looks good. One optional nit for consideration, > > Since it is testing BPF_F_REPLACE here already, > how about moving the > "((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI))" > test from __cgroup_bpf_attach() to this function here? > Clear the BPF_F_REPLACE bit before passing to cgroup_bpf_attach(). > > Then the "saved_flags" logic in cgroup_bpf_attach() can go away. > cgroup_bpf_attach() can work on the "replace_prog" alone. > > Acked-by: Martin KaFai Lau <kafai@xxxxxx> Thank you for review Martin! I considered doing exactly this and a few other options to split the logic between __cgroup_bpf_attach() and cgroup_bpf_prog_attach() since it's not super clear what belongs where, but decided to go with the current approach. A couple of reasons I split it this way: 1) To keep the whole logic and decisions in __cgroup_bpf_attach() and use cgroup_bpf_prog_attach() only to acquire cgroup-bpf specific resources that correspond to the user input, such as cgroup and program to replace. Unfortunately to acquire replace_prog I still need to check flags to avoid unnecessary work for the most common case when BPF_F_REPLACE is not set, but IMO it's better to keep the logic to verify flags combinations in one place, __cgroup_bpf_attach(). 2) Also I think saved_flags would be introduced sooner or later anyway if new flags are added since as it can be seen there is a clear separation between flags that control programs arrangement, like OVERRIDE and MULTI, and should be remembered for the whole life time of the program, and one-time-needed flags such as REPLACE that are needed only once to attach program and don't make sense in its future life time. > > + replace_prog = bpf_prog_get_type(attr->replace_bpf_fd, ptype); > > + if (IS_ERR(replace_prog)) { > > + cgroup_put(cgrp); > > + return PTR_ERR(replace_prog); > > + } > > + } > > + > > + ret = cgroup_bpf_attach(cgrp, prog, replace_prog, attr->attach_type, > > attr->attach_flags); > > + > > + if (replace_prog) > > + bpf_prog_put(replace_prog); > > cgroup_put(cgrp); > > return ret; > > } -- Andrey Ignatov