Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

Eric Dumazet <edumazet@xxxxxxxxxx> · Tue, 5 Dec 2023 20:39:28 +0100

On Tue, Dec 5, 2023 at 3:11 AM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> > On Mon, Dec 4, 2023 at 12:43 PM Philo Lu <lulie@xxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > > which will be called every time a tcp data packet is sent, received, and
> > > acked.
> > > tcp_data_send: called after a data packet is sent.
> > > tcp_data_recv: called after a data packet is receviced.
> > > tcp_data_acked: called after a valid ack packet is processed (some sent
> > > data are ackknowledged).
> > >
> > > We use these callbacks for fine-grained tcp monitoring, which collects
> > > and analyses every tcp request/response event information. The whole
> > > system has been described in SIGMOD'18 (see
> > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > > achieve this with bpf, we require hooks for data events that call bpf
> > > prog (1) when any data packet is sent/received/acked, and (2) after
> > > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > > Besides, these tracepoints help to debug tcp when data send/recv/acked.
> >
> > This I do not understand.
> >
> > >
> > > Though kretprobe/fexit can also be used to collect these information,
> > > they will not work if the kernel functions get inlined. Considering the
> > > stability, we prefer tracepoint as the solution.
> >
> > I dunno, this seems quite weak to me. I see many patches coming to add
> > tracing in the stack, but no patches fixing any issues.
>
>
> We have implemented a mechanism to split the request and response from the TCP
> connection using these "hookers", which can handle various protocols such as
> HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
> information about each request and response, including the amount of data
> uploaded, the time taken by the server to handle the request, and the time taken
> for the client to receive the response. This mechanism has been running
> internally for many years and has proven to be very useful.
>
> One of the main benefits of this mechanism is that it helps in locating the
> source of any issues or problems that may arise. For example, if there is a
> problem with the network, the application, or the machine, we can use this
> mechanism to identify and isolate the issue.
>
> TCP has long been a challenge when it comes to tracking the transmission of data
> on the network. The application can only confirm that it has sent a certain
> amount of data to the kernel, but it has limited visibility into whether the
> client has actually received this data. Our mechanism addresses this issue by
> providing insights into the amount of data received by the client and the time
> it was received. Furthermore, we can also detect any packet loss or delays
> caused by the server.
>
> https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg
>
> So, we do not want to add some tracepoint to do some unknow debug.
> We have a clear goal. debugging is just an incidental capability.
>

We have powerful mechanisms in the stack already that ordinary (no
privilege requested) applications can readily use.

We have been using them for a while.

If existing mechanisms are missing something you need, please expand them.

For reference, start looking at tcp_get_timestamping_opt_stats() history.

Sender side can for instance get precise timestamps.

Combinations of these timestamps reveal different parts of the overall
network latency,

T0: sendmsg() enters TCP
T1: first byte enters qdisc
T2: first byte sent to the NIC
T3: first byte ACKed in TCP
T4: last byte sent to the NIC
T5: last byte ACKed
T1 - T0: how long the first byte was blocked in the TCP layer ("Head
of Line Blocking" latency).
T2 - T1: how long the first byte was blocked in the Linux traffic
shaping layer (known as QDisc).
T3 - T2: the network ‘distance’ (propagation delay + current queuing
delay along the network path and at the receiver).
T5 - T2: how fast the sent chunk was delivered.
Message Size / (T5 - T0): goodput (from application’s perspective)