Re: How to limit TCP packet lengths given to TC egress EBPF programs?

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 15 Jul 2021 17:14:45 -0700

On Thu, Jul 15, 2021 at 4:26 PM Sandesh Dhawaskar Sathyanarayana
<Sandesh.DhawaskarSathyanarayana@xxxxxxxxxxxx> wrote:
>
> Hi ,

Please do not top post and do not use html in your replies.

> I tested the new TCP experimental headers as INT headers in TCP options. But this does not work.
> Programmable switch looks for the INT header after 20 bytes of TCP header. If it finds INT then it just appends its own INT data by parsing INT field in TCP,else it appends its own INT header with data after 20 bytes and if any TCP option is present it will append that after INT.
> Now if we use the TCP options field in the end host as INT fields, the switch looks at TCP header options as INT and appends just the data. Now that the switch has consumed TCP option as INT data, it will not find TCP options to append after it puts its INT data as result the packets will be dropped in the switch.
>
> Hence we need a new way to create INT header space in the TCP kernel stack itself.
>
>
> Here is what I did:
>
> 1. Reserved the space in the TCP header option using BPF_SOCK_OPS_HDR_OPT_LEN_CB.
> 2. Used the TC-eBPF at egress to write INT header in this field.

Hard to guess without looking at the actual code,
but sounds like you did bpf_reserve_hdr_opt() as sockops program,
but didn't do bpf_store_hdr_opt() in BPF_SOCK_OPS_WRITE_HDR_OPT_CB ?
and instead tried to write it in TC layer?
That won't work of course.
Please see progs/test_tcp_hdr_options.c example.

cc-ing Martin for further questions.

> But these packets get dropped at switch as the TCP length doesn;t match.
>
> Also another challenge in appending the INT in the end host at TC-eBPF (currently no support for TCP) is it breaks the TCP SYN and ACK mechanism resulting in spurious retransmissions.  As kernel is not aware of appended data in TC-eBPF at egress.
>
> If anyone has suggestions please do let me know. Currently I am just thinking of creating the space in the kernel TCP stack itself when sk_buff is allocated.
>
> Thanks
> Sandesh
>
> On Tue, Jul 13, 2021 at 5:52 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
>>
>> On Fri, Jul 9, 2021 at 11:40 AM Fingerhut, John Andy
>> <john.andy.fingerhut@xxxxxxxxx> wrote:
>> >
>> > Greetings:
>> >
>> > I am working on a project that runs an EBPF program on the Linux
>> > Traffic Control egress hook, which modifies selected packets to add
>> > headers to them that we use for some network telemetry.
>> >
>> > I know that this is _not_ what one wants to do to get maximum TCP
>> > performance, but at least for development purposes I was hoping to
>> > find a way to limit the length of all TCP packets that are processed
>> > by this EBPF program to be at most one MTU.
>> >
>> > Towards that goal, we have tried several things, but regardless of
>> > which subset of the following things we have tried, there are some
>> > packets processed by our EBPF program that have IPv4 Total Length
>> > field that is some multiple of the MSS size, sometimes nearly 64
>> > KBytes.  If it makes a difference in configuration options available,
>> > we have primarily been testing with Ubuntu 20.04 Linux running the
>> > Linux kernel versions near 5.8.0-50-generic distributed by Canonical.
>> >
>> > Disable TSO and GSO on the network interface:
>> >
>> >     ethtool -K enp0s8 tso off gso off
>> >
>> > Configuring TCP MSS using 'ip route' command:
>> >
>> >     ip route change 10.0.3.0/24 dev enp0s8 advmss 1424
>> >
>> > The last command _does_ have some effect, in that many packets
>> > processed by our EBPF program have a length affected by that advmss
>> > value, but we still see many packets that are about twice as large,
>> > about three times as large, etc., which fit into that MSS after being
>> > segmented, I believe in the kernel GSO code.
>> >
>> > Is there some other configuration option we can change that can
>> > guarantee that when a TCP packet is given to a TC egress EBPF program,
>> > it will always be at most a specified length?
>> >
>> >
>> > Background:
>> >
>> > Intel is developing and releasing some open source EBPF programs and
>> > associated user space programs that modify packets to add INT (Inband
>> > Network Telemetry) headers, which can be used for some kinds of
>> > performance debugging reasons, e.g. triggering events when packet
>> > losses are detected, or significant changes in one-way packet latency
>> > between two hosts configured to run this Host INT code.  See the
>> > project home page for more details if you are interested:
>> >
>> > https://github.com/intel/host-int
>>
>> I suspect MTU/MSS issue is only the tip of the iceberg.
>>
>> https://github.com/intel/host-int/blob/main/docs/Host_INT_fmt.md
>> That's an interesting design !
>> Few things should be probably be addressed sooner than later:
>> "Host INT currently only supports adding INT headers to IPv4 packets."
>> To consider such a feature of Tofino switches IPv6 has to be supported.
>> That shouldn't be hard to do, right?
>>
>> https://github.com/intel/host-int/blob/main/docs/host-int-project.pptx
>> That's a lot of bpf programs :)
>> Looks like in the bridge case (last slide) every incoming packet will
>> be processed
>> by two XDP programs.
>> XDP is certainly fast, but it still adds overhead.
>> Not every packet will have such INT header so most of the packets will be
>> passing through XDP prog into the stack or from stack through TC egress program.
>> Such XDP ingress and TC egress progs will add overhead that might be
>> unacceptable in production deployment.
>> Have you considered using the new TCP header option instead?
>> https://lore.kernel.org/bpf/CAADnVQJ21Tt2HaJ5P4wbxBLVo1YT-PwN3bOHBQK+17reK5HxOg@xxxxxxxxxxxxxx/
>> BPF prog can conditionally add it for few packets/flows and another BPF prog
>> on receive side will process such header option.
>> While Tofino switch will find packets with a special TCP header and fill them in
>> with telemetry data.
>> "INT report packets are sent as UDP datagrams" part of the design can stay.
>> Looks like you're reserving a UDP port for such a purpose, so no need
>> for the receive side to have an XDP program to process every packet.
>>
>> With TCP header option approach the MTU issue will go away as well.
>>
>> > Note: The code published now is an alpha release.  We know there are
>> > bugs.  We know our development team is not what you would call EBPF
>> > experts (at least not yet), so feel free to point out bugs and/or
>> > anything that code is doing that might be a bad idea.
>>
>> Thank you for reaching out. We're here to help with your BPF/XDP needs :)
>>
>> > Thanks,
>> > Andy Fingerhut
>> > Principal Engineer
>> > Intel Corporation