Re: [RFC PATCH bpf-next 1/2] bpf: Add bpf_current_capable kfunc

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 21 Aug 2023 13:56:47 +0800

On Thu, Aug 17, 2023 at 11:30 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> On 8/17/23 9:09 AM, Yafang Shao wrote:
> > On Thu, Aug 17, 2023 at 11:31 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> >> On Wed, Aug 16, 2023 at 7:31 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >>> On Thu, Aug 17, 2023 at 9:54 AM Alexei Starovoitov
> >>> <alexei.starovoitov@xxxxxxxxx> wrote:
> >>>> On Mon, Aug 14, 2023 at 7:46 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >>>>> On Tue, Aug 15, 2023 at 8:28 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >>>>>> On 8/14/23 7:33 AM, Yafang Shao wrote:
> >>>>>>> Add a new bpf_current_capable kfunc to check whether the current task
> >>>>>>> has a specific capability. In our use case, we will use it in a lsm bpf
> >>>>>>> program to help identify if the user operation is permitted.
> >>>>>>>
> >>>>>>> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> >>>>>>> ---
> >>>>>>>    kernel/bpf/helpers.c | 6 ++++++
> >>>>>>>    1 file changed, 6 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >>>>>>> index eb91cae..bbee7ea 100644
> >>>>>>> --- a/kernel/bpf/helpers.c
> >>>>>>> +++ b/kernel/bpf/helpers.c
> >>>>>>> @@ -2429,6 +2429,11 @@ __bpf_kfunc void bpf_rcu_read_unlock(void)
> >>>>>>>        rcu_read_unlock();
> >>>>>>>    }
> >>>>>>>
> >>>>>>> +__bpf_kfunc bool bpf_current_capable(int cap)
> >>>>>>> +{
> >>>>>>> +     return has_capability(current, cap);
> >>>>>>> +}
> >>>>>>
> >>>>>> Since you are testing against 'current' capabilities, I assume
> >>>>>> that the context should be process. Otherwise, you are testing
> >>>>>> against random task which does not make much sense.
> >>>>>
> >>>>> It is in the process context.
> >>>>>
> >>>>>> Since you are testing against 'current' cap, and if the capability
> >>>>>> for that task is stable, you do not need this kfunc.
> >>>>>> You can test cap in user space and pass it into the bpf program.
> >>>>>>
> >>>>>> But if the cap for your process may change in the middle of
> >>>>>> run, then you indeed need bpf prog to test capability in real time.
> >>>>>> Is this your use case and could you describe in in more detail?
> >>>>>
> >>>>> After we convert the capability of our networking bpf program from
> >>>>> CAP_SYS_ADMIN to CAP_BPF+CAP_NET_ADMIN to enhance the security, we
> >>>>> encountered the "pointer comparison prohibited" error, because
> >>>>> allow_ptr_leaks is enabled only when CAP_PERFMON is set. However, if
> >>>>> we enable the CAP_PERFMON for the networking bpf program, it can run
> >>>>> tracing bpf prog, perf_event bpf prog and etc, that is not expected by
> >>>>> us.
> >>>>>
> >>>>> Hence we are planning to use a lsm bpf program to disallow it from
> >>>>> running other bpf programs. In our lsm bpf program we will check the
> >>>>> capability of processes, if the process has cap_net_admin, cap_bpf and
> >>>>> cap_perfmon but don't have cap_sys_admin we will refuse it to run
> >>>>> tracing and perf_event bpf program. While if a process has  cap_bpf
> >>>>> and cap_perfmon but doesn't have cap_net_admin, that said it is a bpf
> >>>>> program which wants to run trace bpf, we will allow it.
> >>>>>
> >>>>> We can't use lsm_cgroup because it is supported on cgroup2 only, while
> >>>>> we're still using cgroup1.
> >>>>>
> >>>>> Another possible solution is enable allow_ptr_leaks for cap_net_admin
> >>>>> as well, but after I checked the commit which introduces the cap_bpf
> >>>>> and cap_perfmon [1], I think we wouldn't like to do it.
> >>>>
> >>>> Sorry. None of these options are acceptable.
> >>>>
> >>>> The idea of introducing a bpf_current_capable() kfunc just to work
> >>>> around a deficiency in the verifier is not sound.
> >>>
> >>> So what should we do then?
> >>> Just enabling the cap_perfmon for it? That does not sound as well ...
> >>
> >> Yonghong already pointed out upthread that
> >> comparison of two packet pointers is not a pointer leak.
> >> See this code:
> >>          } else if (!try_match_pkt_pointers(insn, dst_reg, &regs[insn->src_reg],
> >>                                             this_branch, other_branch) &&
> >>                     is_pointer_value(env, insn->dst_reg)) {
> >>                  verbose(env, "R%d pointer comparison prohibited\n",
> >>                          insn->dst_reg);
> >>                  return -EACCES;
> >>          }
> >>
> >> It's not clear why it doesn't address your case.
> >
> > It can address the issue.
> > It seems we should do the code change below.
>
> I presume in your actual program you also access the IP header (otherwise it
> would be quite useless), and as you mentioned above, you would like this to be
> accessible for just CAP_BPF + CAP_NET_ADMIN. The CAP_PERFMON restriction is
> there for a reason, that is, to be on same cap level as tracing programs as you
> could craft Spectre v1 style access. It's not a deficiency.

Hi Daniel,

We also hit an error caused by bpf_bypass_spec_v1 in our
networking-bpf with cap_net_admin+cap_bpf.

Here is a simple reproducer,

#include <linux/pkt_cls.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/bpf.h>

#include <bpf/bpf_helpers.h>

struct flow_stats_v {
    __u32 egress_packets;
    __u32 ingress_packets;
};

/* ELF map definition */
struct bpf_elf_map {
    __u32 type;
    __u32 size_key;
    __u32 size_value;
    __u32 max_elem;
    __u32 flags;
    __u32 id;
    __u32 pinning;
    __u32 inner_id;
    __u32 inner_idx;
};

struct bpf_elf_map SEC("maps") flow_stats = {
    .type = BPF_MAP_TYPE_HASH,
    .size_key = sizeof(__u32),
    .size_value = sizeof(struct flow_stats_v),
    .max_elem = 32,
    .flags = BPF_F_NO_PREALLOC,
    .pinning = 0,
};

struct bpf_elf_map SEC("maps") classid = {
    .type = BPF_MAP_TYPE_HASH,
    .size_key = sizeof(__be32),
    .size_value = sizeof(__be32),
    .max_elem = 32,
    .flags = BPF_F_NO_PREALLOC,
    .pinning = 0,
};

struct flow {
    __be32 local_classid;
    __be32 local_addr;
    __u8 dir;
};

static void flow_stat(struct __sk_buff *skb, struct flow *flow)
{
        struct flow_stats_v *value;
        __u32 key = 0; // This value is not relavant

        value = bpf_map_lookup_elem(&flow_stats, &key);
        if (!value)
                return;

        if (flow->dir == 0)
                __sync_fetch_and_add(&value->egress_packets, 1);
        else
                __sync_fetch_and_add(&value->ingress_packets, 1);
}

int handle_packet(struct __sk_buff *skb, int dir)
{
        void *data_end = (void *)(long)skb->data_end;
        void *data = (void *)(long)skb->data;
        struct flow flow = {};
        __be32 *id;

        flow.dir = dir;
        flow.local_addr = 1; // This value is not relavant

        id = bpf_map_lookup_elem(&classid, &flow.local_addr);
        if (id && *id)
                flow.local_classid = *id;

        flow_stat(skb, &flow);
        return TC_ACT_OK;
}

SEC("cls-ingress")
int ingress(struct __sk_buff *skb)
{
        return handle_packet(skb, 1);
}

char _license[] SEC("license") = "GPL";


The error log of the bpf verifier as follows,

23: (71) r3 = *(u8 *)(r10 -8)         ;
R3_w=scalar(umax=255,var_off=(0x0; 0xff)) R10=fp0 fp-8=mmmm???m
24: (b7) r2 = 1                       ; R2_w=1
25: (55) if r3 != 0x0 goto pc+1 27:
R0=map_value(off=0,ks=4,vs=8,imm=0) R1_w=1 R2_w=1
R3_w=scalar(umax=255,var_off=(0x0; 0xff)) R10=fp0 fp-8=mmmm???m fp-16=
27: (67) r2 <<= 2                     ; R2_w=4
28: (0f) r0 += r2
R0 tried to add from different maps, paths or scalars, pointer
arithmetic with it prohibited for !root
processed 31 insns (limit 1000000000) max_states_per_insn 0
total_states 3 peak_states 3 mark_read 1

It can be fixed by the code change below in the kernel,

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b6b60cd..0e200a9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11750,7 +11750,7 @@ static int update_alu_sanitation_state(struct
bpf_insn_aux_data *aux,
        /* If we arrived here from different branches with different
         * state or limits to sanitize, then this won't work.
         */
-       if (aux->alu_state &&
+       if (aux->alu_state && aux->alu_limit &&
            (aux->alu_state != alu_state ||
             aux->alu_limit != alu_limit))
                return REASON_PATHS;

But I'm not sure if it is a proper fix.

We can workaround the issue by modifying our bpf program as follows,

@@ -77,7 +77,21 @@ int handle_packet(struct __sk_buff *skb, int dir)
        if (id && *id)
                flow.local_classid = *id;

-       flow_stat(skb, &flow);
+       {
+       struct flow_stats_v *value;
+       __u32 key = 0; // This value is not relavant
+
+       value = bpf_map_lookup_elem(&flow_stats, &key);
+       if (!value)
+               return TC_ACT_OK;
+
+       if (flow.dir == 0)
+               __sync_fetch_and_add(&value->egress_packets, 1);
+       else
+               __sync_fetch_and_add(&value->ingress_packets, 1);
+
+       }
+       // flow_stat(skb, &flow);
        return TC_ACT_OK;
 }

But I don't have a clear idea why.
It seems to me that the bpf verifier under cap_net_admin+cap_bpf is
fragile.  I'm wondering if it is possible to add a sysctl to bypass
the perfmon check :

--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2091,22 +2091,22 @@ static inline void
bpf_map_dec_elem_count(struct bpf_map *map)

 static inline bool bpf_allow_ptr_leaks(void)
 {
-       return perfmon_capable();
+       return perfmon_capable() || syscal_bypass_perfmon;
 }

 static inline bool bpf_allow_uninit_stack(void)
 {
-       return perfmon_capable();
+       return perfmon_capable() || syscal_bypass_perfmon;
 }

 static inline bool bpf_bypass_spec_v1(void)
 {
-       return perfmon_capable();
+       return perfmon_capable() || syscal_bypass_perfmon;
 }

 static inline bool bpf_bypass_spec_v4(void)
 {
-       return perfmon_capable();
+       return perfmon_capable() || syscal_bypass_perfmon;
 }

 int bpf_map_new_fd(struct bpf_map *map, int flags);

Do you have any suggestions?

-- 
Regards
Yafang