On Thu, Aug 17, 2023 at 11:30 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 8/17/23 9:09 AM, Yafang Shao wrote: > > On Thu, Aug 17, 2023 at 11:31 AM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > >> On Wed, Aug 16, 2023 at 7:31 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > >>> On Thu, Aug 17, 2023 at 9:54 AM Alexei Starovoitov > >>> <alexei.starovoitov@xxxxxxxxx> wrote: > >>>> On Mon, Aug 14, 2023 at 7:46 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > >>>>> On Tue, Aug 15, 2023 at 8:28 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > >>>>>> On 8/14/23 7:33 AM, Yafang Shao wrote: > >>>>>>> Add a new bpf_current_capable kfunc to check whether the current task > >>>>>>> has a specific capability. In our use case, we will use it in a lsm bpf > >>>>>>> program to help identify if the user operation is permitted. > >>>>>>> > >>>>>>> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > >>>>>>> --- > >>>>>>> kernel/bpf/helpers.c | 6 ++++++ > >>>>>>> 1 file changed, 6 insertions(+) > >>>>>>> > >>>>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >>>>>>> index eb91cae..bbee7ea 100644 > >>>>>>> --- a/kernel/bpf/helpers.c > >>>>>>> +++ b/kernel/bpf/helpers.c > >>>>>>> @@ -2429,6 +2429,11 @@ __bpf_kfunc void bpf_rcu_read_unlock(void) > >>>>>>> rcu_read_unlock(); > >>>>>>> } > >>>>>>> > >>>>>>> +__bpf_kfunc bool bpf_current_capable(int cap) > >>>>>>> +{ > >>>>>>> + return has_capability(current, cap); > >>>>>>> +} > >>>>>> > >>>>>> Since you are testing against 'current' capabilities, I assume > >>>>>> that the context should be process. Otherwise, you are testing > >>>>>> against random task which does not make much sense. > >>>>> > >>>>> It is in the process context. > >>>>> > >>>>>> Since you are testing against 'current' cap, and if the capability > >>>>>> for that task is stable, you do not need this kfunc. > >>>>>> You can test cap in user space and pass it into the bpf program. > >>>>>> > >>>>>> But if the cap for your process may change in the middle of > >>>>>> run, then you indeed need bpf prog to test capability in real time. > >>>>>> Is this your use case and could you describe in in more detail? > >>>>> > >>>>> After we convert the capability of our networking bpf program from > >>>>> CAP_SYS_ADMIN to CAP_BPF+CAP_NET_ADMIN to enhance the security, we > >>>>> encountered the "pointer comparison prohibited" error, because > >>>>> allow_ptr_leaks is enabled only when CAP_PERFMON is set. However, if > >>>>> we enable the CAP_PERFMON for the networking bpf program, it can run > >>>>> tracing bpf prog, perf_event bpf prog and etc, that is not expected by > >>>>> us. > >>>>> > >>>>> Hence we are planning to use a lsm bpf program to disallow it from > >>>>> running other bpf programs. In our lsm bpf program we will check the > >>>>> capability of processes, if the process has cap_net_admin, cap_bpf and > >>>>> cap_perfmon but don't have cap_sys_admin we will refuse it to run > >>>>> tracing and perf_event bpf program. While if a process has cap_bpf > >>>>> and cap_perfmon but doesn't have cap_net_admin, that said it is a bpf > >>>>> program which wants to run trace bpf, we will allow it. > >>>>> > >>>>> We can't use lsm_cgroup because it is supported on cgroup2 only, while > >>>>> we're still using cgroup1. > >>>>> > >>>>> Another possible solution is enable allow_ptr_leaks for cap_net_admin > >>>>> as well, but after I checked the commit which introduces the cap_bpf > >>>>> and cap_perfmon [1], I think we wouldn't like to do it. > >>>> > >>>> Sorry. None of these options are acceptable. > >>>> > >>>> The idea of introducing a bpf_current_capable() kfunc just to work > >>>> around a deficiency in the verifier is not sound. > >>> > >>> So what should we do then? > >>> Just enabling the cap_perfmon for it? That does not sound as well ... > >> > >> Yonghong already pointed out upthread that > >> comparison of two packet pointers is not a pointer leak. > >> See this code: > >> } else if (!try_match_pkt_pointers(insn, dst_reg, ®s[insn->src_reg], > >> this_branch, other_branch) && > >> is_pointer_value(env, insn->dst_reg)) { > >> verbose(env, "R%d pointer comparison prohibited\n", > >> insn->dst_reg); > >> return -EACCES; > >> } > >> > >> It's not clear why it doesn't address your case. > > > > It can address the issue. > > It seems we should do the code change below. > > I presume in your actual program you also access the IP header (otherwise it > would be quite useless), and as you mentioned above, you would like this to be > accessible for just CAP_BPF + CAP_NET_ADMIN. The CAP_PERFMON restriction is > there for a reason, that is, to be on same cap level as tracing programs as you > could craft Spectre v1 style access. It's not a deficiency. Hi Daniel, We also hit an error caused by bpf_bypass_spec_v1 in our networking-bpf with cap_net_admin+cap_bpf. Here is a simple reproducer, #include <linux/pkt_cls.h> #include <linux/if_ether.h> #include <linux/ip.h> #include <linux/tcp.h> #include <linux/bpf.h> #include <bpf/bpf_helpers.h> struct flow_stats_v { __u32 egress_packets; __u32 ingress_packets; }; /* ELF map definition */ struct bpf_elf_map { __u32 type; __u32 size_key; __u32 size_value; __u32 max_elem; __u32 flags; __u32 id; __u32 pinning; __u32 inner_id; __u32 inner_idx; }; struct bpf_elf_map SEC("maps") flow_stats = { .type = BPF_MAP_TYPE_HASH, .size_key = sizeof(__u32), .size_value = sizeof(struct flow_stats_v), .max_elem = 32, .flags = BPF_F_NO_PREALLOC, .pinning = 0, }; struct bpf_elf_map SEC("maps") classid = { .type = BPF_MAP_TYPE_HASH, .size_key = sizeof(__be32), .size_value = sizeof(__be32), .max_elem = 32, .flags = BPF_F_NO_PREALLOC, .pinning = 0, }; struct flow { __be32 local_classid; __be32 local_addr; __u8 dir; }; static void flow_stat(struct __sk_buff *skb, struct flow *flow) { struct flow_stats_v *value; __u32 key = 0; // This value is not relavant value = bpf_map_lookup_elem(&flow_stats, &key); if (!value) return; if (flow->dir == 0) __sync_fetch_and_add(&value->egress_packets, 1); else __sync_fetch_and_add(&value->ingress_packets, 1); } int handle_packet(struct __sk_buff *skb, int dir) { void *data_end = (void *)(long)skb->data_end; void *data = (void *)(long)skb->data; struct flow flow = {}; __be32 *id; flow.dir = dir; flow.local_addr = 1; // This value is not relavant id = bpf_map_lookup_elem(&classid, &flow.local_addr); if (id && *id) flow.local_classid = *id; flow_stat(skb, &flow); return TC_ACT_OK; } SEC("cls-ingress") int ingress(struct __sk_buff *skb) { return handle_packet(skb, 1); } char _license[] SEC("license") = "GPL"; The error log of the bpf verifier as follows, 23: (71) r3 = *(u8 *)(r10 -8) ; R3_w=scalar(umax=255,var_off=(0x0; 0xff)) R10=fp0 fp-8=mmmm???m 24: (b7) r2 = 1 ; R2_w=1 25: (55) if r3 != 0x0 goto pc+1 27: R0=map_value(off=0,ks=4,vs=8,imm=0) R1_w=1 R2_w=1 R3_w=scalar(umax=255,var_off=(0x0; 0xff)) R10=fp0 fp-8=mmmm???m fp-16= 27: (67) r2 <<= 2 ; R2_w=4 28: (0f) r0 += r2 R0 tried to add from different maps, paths or scalars, pointer arithmetic with it prohibited for !root processed 31 insns (limit 1000000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 1 It can be fixed by the code change below in the kernel, diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b6b60cd..0e200a9 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11750,7 +11750,7 @@ static int update_alu_sanitation_state(struct bpf_insn_aux_data *aux, /* If we arrived here from different branches with different * state or limits to sanitize, then this won't work. */ - if (aux->alu_state && + if (aux->alu_state && aux->alu_limit && (aux->alu_state != alu_state || aux->alu_limit != alu_limit)) return REASON_PATHS; But I'm not sure if it is a proper fix. We can workaround the issue by modifying our bpf program as follows, @@ -77,7 +77,21 @@ int handle_packet(struct __sk_buff *skb, int dir) if (id && *id) flow.local_classid = *id; - flow_stat(skb, &flow); + { + struct flow_stats_v *value; + __u32 key = 0; // This value is not relavant + + value = bpf_map_lookup_elem(&flow_stats, &key); + if (!value) + return TC_ACT_OK; + + if (flow.dir == 0) + __sync_fetch_and_add(&value->egress_packets, 1); + else + __sync_fetch_and_add(&value->ingress_packets, 1); + + } + // flow_stat(skb, &flow); return TC_ACT_OK; } But I don't have a clear idea why. It seems to me that the bpf verifier under cap_net_admin+cap_bpf is fragile. I'm wondering if it is possible to add a sysctl to bypass the perfmon check : --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2091,22 +2091,22 @@ static inline void bpf_map_dec_elem_count(struct bpf_map *map) static inline bool bpf_allow_ptr_leaks(void) { - return perfmon_capable(); + return perfmon_capable() || syscal_bypass_perfmon; } static inline bool bpf_allow_uninit_stack(void) { - return perfmon_capable(); + return perfmon_capable() || syscal_bypass_perfmon; } static inline bool bpf_bypass_spec_v1(void) { - return perfmon_capable(); + return perfmon_capable() || syscal_bypass_perfmon; } static inline bool bpf_bypass_spec_v4(void) { - return perfmon_capable(); + return perfmon_capable() || syscal_bypass_perfmon; } int bpf_map_new_fd(struct bpf_map *map, int flags); Do you have any suggestions? -- Regards Yafang