Re: [PATCH v3 bpf-next 4/5] net: netfilter: add kfunc helper to add a new ct entry

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 5/19/22 4:23 AM, Kumar Kartikeya Dwivedi wrote:
On Thu, May 19, 2022 at 04:15:51PM IST, Toke Høiland-Jørgensen wrote:
Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> writes:

On Thu, May 19, 2022 at 03:44:58AM IST, Alexei Starovoitov wrote:
On Wed, May 18, 2022 at 2:09 PM Lorenzo Bianconi
<lorenzo.bianconi@xxxxxxxxxx> wrote:

Lorenzo Bianconi <lorenzo@xxxxxxxxxx> writes:

Introduce bpf_xdp_ct_add and bpf_skb_ct_add kfunc helpers in order to
add a new entry to ct map from an ebpf program.
Introduce bpf_nf_ct_tuple_parse utility routine.

Signed-off-by: Lorenzo Bianconi <lorenzo@xxxxxxxxxx>
---
  net/netfilter/nf_conntrack_bpf.c | 212 +++++++++++++++++++++++++++----
  1 file changed, 189 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/nf_conntrack_bpf.c b/net/netfilter/nf_conntrack_bpf.c
index a9271418db88..3d31b602fdf1 100644
--- a/net/netfilter/nf_conntrack_bpf.c
+++ b/net/netfilter/nf_conntrack_bpf.c
@@ -55,41 +55,114 @@ enum {
     NF_BPF_CT_OPTS_SZ = 12,
  };

-static struct nf_conn *__bpf_nf_ct_lookup(struct net *net,
-                                     struct bpf_sock_tuple *bpf_tuple,
-                                     u32 tuple_len, u8 protonum,
-                                     s32 netns_id, u8 *dir)
+static int bpf_nf_ct_tuple_parse(struct bpf_sock_tuple *bpf_tuple,
+                            u32 tuple_len, u8 protonum, u8 dir,
+                            struct nf_conntrack_tuple *tuple)
  {
-   struct nf_conntrack_tuple_hash *hash;
-   struct nf_conntrack_tuple tuple;
-   struct nf_conn *ct;
+   union nf_inet_addr *src = dir ? &tuple->dst.u3 : &tuple->src.u3;
+   union nf_inet_addr *dst = dir ? &tuple->src.u3 : &tuple->dst.u3;
+   union nf_conntrack_man_proto *sport = dir ? (void *)&tuple->dst.u
+                                             : &tuple->src.u;
+   union nf_conntrack_man_proto *dport = dir ? &tuple->src.u
+                                             : (void *)&tuple->dst.u;

     if (unlikely(protonum != IPPROTO_TCP && protonum != IPPROTO_UDP))
-           return ERR_PTR(-EPROTO);
-   if (unlikely(netns_id < BPF_F_CURRENT_NETNS))
-           return ERR_PTR(-EINVAL);
+           return -EPROTO;
+
+   memset(tuple, 0, sizeof(*tuple));

-   memset(&tuple, 0, sizeof(tuple));
     switch (tuple_len) {
     case sizeof(bpf_tuple->ipv4):
-           tuple.src.l3num = AF_INET;
-           tuple.src.u3.ip = bpf_tuple->ipv4.saddr;
-           tuple.src.u.tcp.port = bpf_tuple->ipv4.sport;
-           tuple.dst.u3.ip = bpf_tuple->ipv4.daddr;
-           tuple.dst.u.tcp.port = bpf_tuple->ipv4.dport;
+           tuple->src.l3num = AF_INET;
+           src->ip = bpf_tuple->ipv4.saddr;
+           sport->tcp.port = bpf_tuple->ipv4.sport;
+           dst->ip = bpf_tuple->ipv4.daddr;
+           dport->tcp.port = bpf_tuple->ipv4.dport;
             break;
     case sizeof(bpf_tuple->ipv6):
-           tuple.src.l3num = AF_INET6;
-           memcpy(tuple.src.u3.ip6, bpf_tuple->ipv6.saddr, sizeof(bpf_tuple->ipv6.saddr));
-           tuple.src.u.tcp.port = bpf_tuple->ipv6.sport;
-           memcpy(tuple.dst.u3.ip6, bpf_tuple->ipv6.daddr, sizeof(bpf_tuple->ipv6.daddr));
-           tuple.dst.u.tcp.port = bpf_tuple->ipv6.dport;
+           tuple->src.l3num = AF_INET6;
+           memcpy(src->ip6, bpf_tuple->ipv6.saddr, sizeof(bpf_tuple->ipv6.saddr));
+           sport->tcp.port = bpf_tuple->ipv6.sport;
+           memcpy(dst->ip6, bpf_tuple->ipv6.daddr, sizeof(bpf_tuple->ipv6.daddr));
+           dport->tcp.port = bpf_tuple->ipv6.dport;
             break;
     default:
-           return ERR_PTR(-EAFNOSUPPORT);
+           return -EAFNOSUPPORT;
     }
+   tuple->dst.protonum = protonum;
+   tuple->dst.dir = dir;
+
+   return 0;
+}

-   tuple.dst.protonum = protonum;
+struct nf_conn *
+__bpf_nf_ct_alloc_entry(struct net *net, struct bpf_sock_tuple *bpf_tuple,
+                   u32 tuple_len, u8 protonum, s32 netns_id, u32 timeout)
+{
+   struct nf_conntrack_tuple otuple, rtuple;
+   struct nf_conn *ct;
+   int err;
+
+   if (unlikely(netns_id < BPF_F_CURRENT_NETNS))
+           return ERR_PTR(-EINVAL);
+
+   err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, protonum,
+                               IP_CT_DIR_ORIGINAL, &otuple);
+   if (err < 0)
+           return ERR_PTR(err);
+
+   err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, protonum,
+                               IP_CT_DIR_REPLY, &rtuple);
+   if (err < 0)
+           return ERR_PTR(err);
+
+   if (netns_id >= 0) {
+           net = get_net_ns_by_id(net, netns_id);
+           if (unlikely(!net))
+                   return ERR_PTR(-ENONET);
+   }
+
+   ct = nf_conntrack_alloc(net, &nf_ct_zone_dflt, &otuple, &rtuple,
+                           GFP_ATOMIC);
+   if (IS_ERR(ct))
+           goto out;
+
+   ct->timeout = timeout * HZ + jiffies;
+   ct->status |= IPS_CONFIRMED;
+
+   memset(&ct->proto, 0, sizeof(ct->proto));
+   if (protonum == IPPROTO_TCP)
+           ct->proto.tcp.state = TCP_CONNTRACK_ESTABLISHED;

Hmm, isn't it a bit limiting to hard-code this to ESTABLISHED
connections? Presumably for TCP you'd want to use this when you see a
SYN and then rely on conntrack to help with the subsequent state
tracking for when the SYN-ACK comes back? What's the usecase for
creating an entry in ESTABLISHED state, exactly?

I guess we can even add a parameter and pass the state from the caller.
I was not sure if it is mandatory.

It's probably cleaner and more flexible to split
_alloc and _insert into two kfuncs and let bpf
prog populate ct directly.

Right, so we can just whitelist a few fields and allow assignments into those.
One small problem is that we should probably only permit this for nf_conn
PTR_TO_BTF_ID obtained from _alloc, and make it rdonly on _insert.

We can do the rw->ro conversion by taking in ref from alloc, and releasing on
_insert, then returning ref from _insert.

Sounds reasonable enough; I guess _insert would also need to
sanity-check some of the values to prevent injecting invalid state into
the conntrack table.

For the other part, either return a different shadow PTR_TO_BTF_ID
with only the fields that can be set, convert insns for it, and then
on insert return the rdonly PTR_TO_BTF_ID of struct nf_conn, or
otherwise store the source func in the per-register state and use that
to deny BPF_WRITE for normal nf_conn. Thoughts?

Hmm, if they're different BTF IDs wouldn't the BPF program have to be
aware of this and use two different structs for the pointer storage?
That seems a bit awkward from an API PoV?


You only need to use a different pointer after _alloc and pass it into _insert.

Like:
	struct nf_conn_alloc *nfa = nf_alloc(...);
	if (!nfa) { ... }
	nfa->status = ...; // gets converted to nf_conn access
	nfa->tcp_status = ...; // ditto
	struct nf_conn *nf = nf_insert(nfa, ...); // nfa released, nf acquired

The problem is that if I whitelist it for nf_conn as a whole so that we can
assign after _alloc, there is no way to prevent BPF_WRITE for nf_conn obtained
from other functions. We can fix it though by remembering which function a
pointer came from, then you wouldn't need a different struct. I was just
soliciting opinions for different options. I am leaning towards not having to
use a separate struct as well.

Is it possible that we define the signature of nf_insert() as
  const struct nf_conn *nf_insert(...)
so for
  const struct nf_conn *nf = nf_insert(nfa, ...);

if there are any nf->status = ..., the compiler will emit a warning.

Also verifier can know the return value of nf_insert() is read-only
and can prevent value overwrite.

Maybe I missed some context, but the above is based on what
I understood so far.


Also, what about updating? For this to be useful with TCP, you'd really
want to be able to update the CT state as the connection is going
through the handshake state transitions...


I think updates should be done using dedicated functions, like the timeout
helper. Whatever synchronization is needed to update the CT can go into that
function, instead of allowing direct writes after _insert.

-Toke


--
Kartikeya



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux