Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 9, 2020 at 6:25 AM Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> wrote:
>
> On Thu, Jul 09, 2020 at 06:08 AM CEST, Andrii Nakryiko wrote:
> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> wrote:
> >>
> >> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> >> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> >> when looking up a listening socket for a new connection request for
> >> connection oriented protocols, or when looking up an unconnected socket for
> >> a packet for connection-less protocols.
> >>
> >> When called, SK_LOOKUP BPF program can select a socket that will receive
> >> the packet. This serves as a mechanism to overcome the limits of what
> >> bind() API allows to express. Two use-cases driving this work are:
> >>
> >>  (1) steer packets destined to an IP range, on fixed port to a socket
> >>
> >>      192.0.2.0/24, port 80 -> NGINX socket
> >>
> >>  (2) steer packets destined to an IP address, on any port to a socket
> >>
> >>      198.51.100.1, any port -> L7 proxy socket
> >>
> >> In its run-time context program receives information about the packet that
> >> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> >> address 4-tuple. Context can be further extended to include ingress
> >> interface identifier.
> >>
> >> To select a socket BPF program fetches it from a map holding socket
> >> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> >> helper to record the selection. Transport layer then uses the selected
> >> socket as a result of socket lookup.
> >>
> >> This patch only enables the user to attach an SK_LOOKUP program to a
> >> network namespace. Subsequent patches hook it up to run on local delivery
> >> path in ipv4 and ipv6 stacks.
> >>
> >> Suggested-by: Marek Majkowski <marek@xxxxxxxxxxxxxx>
> >> Signed-off-by: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx>
> >> ---
> >>
> >> Notes:
> >>     v3:
> >>     - Allow bpf_sk_assign helper to replace previously selected socket only
> >>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
> >>       programs running in series to accidentally override each other's verdict.
> >>     - Let BPF program decide that load-balancing within a reuseport socket group
> >>       should be skipped for the socket selected with bpf_sk_assign() by passing
> >>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
> >>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
> >>       the selected socket with an intention for multiple attached program
> >>       running in series to see each other's choices. However, currently the
> >>       verifier doesn't allow checking if pointer is set.
> >>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
> >>     - Get rid of macros in convert_ctx_access to make it easier to read.
> >>     - Disallow 1-,2-byte access to context fields containing IP addresses.
> >>
> >>     v2:
> >>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
> >>       Update bpf_sk_assign docs accordingly. (Martin)
> >>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
> >>     - Fix broken build when CONFIG_INET is not selected. (Martin)
> >>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
> >>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
> >>
> >>  include/linux/bpf-netns.h  |   3 +
> >>  include/linux/bpf_types.h  |   2 +
> >>  include/linux/filter.h     |  19 ++++
> >>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
> >>  kernel/bpf/net_namespace.c |   5 +
> >>  kernel/bpf/syscall.c       |   9 ++
> >>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
> >>  scripts/bpf_helpers_doc.py |   9 +-
> >>  8 files changed, 306 insertions(+), 1 deletion(-)
> >>
>
> [...]
>
> >> +
> >> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> >> +                                       const struct bpf_insn *si,
> >> +                                       struct bpf_insn *insn_buf,
> >> +                                       struct bpf_prog *prog,
> >> +                                       u32 *target_size)
> >
> > Would it be too extreme to rely on BTF and direct memory access
> > (similar to tp_raw, fentry/fexit, etc) for accessing context fields,
> > instead of all this assembly rewrites? So instead of having
> > bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
> > (bpf_sk_lookup_kern, or however we'd want to name it then) and
> > verifier will just ensure that direct memory reads go to the right
> > field boundaries?
>
> Sounds like a decision related to long-term vision. I'd appreciate input
> from maintainers if this is the direction we want to go in.
>
> From implementation PoV - hard for me to say what would be needed to get
> it working, I'm not familiar how BPF_TRACE_* attach types provide access
> to context, so I'd need to look around and prototype it
> first. (Actually, I'm not sure if you're asking if it is doable or you
> already know?)

I'm pretty sure it's doable with what we have in verifier, but I'm not
sure about all the details and amount of work. So consider this an
initiation of a medium-term discussion. I was also curious to hear an
opinion from Alexei and Daniel whether that's would be the right way
to do this moving forward (not necessarily with your changes, though).

>
> Off the top of my head, I have one concern, I'm exposing the selected
> socket in the context. This is for the benefit of one program being
> aware of other program's selection, if multiple programs are attached.
>
> I understand that any piece of data reachable from struct sock *, would
> be readable by SK_LOOKUP prog (writes can be blocked in
> is_valid_access). And that this is a desired property for tracing. Not
> sure how to limit it for a network program that doesn't need all that
> info.
>
> >
> >> +{
> >> +       struct bpf_insn *insn = insn_buf;
> >> +#if IS_ENABLED(CONFIG_IPV6)
> >> +       int off;
> >> +#endif
> >> +
> >
> > [...]



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux