Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset

Jason Xing <kerneljasonxing@xxxxxxxxx> · Tue, 29 Oct 2024 10:41:10 +0800

On Tue, Oct 29, 2024 at 9:24 AM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@xxxxxxxxxxx>
> >
> > Use the offset to record the delta value between current socket key
> > and bpf socket key.
> >
> > 1. If there is only bpf feature running, the socket key is bpf socket
> > key and the offset is zero;
> > 2. If there is only traditional feature running, and then bpf feature
> > is turned on, the socket key is still used by the former while the offset
> > is the delta between them;
> > 3. if there is only bpf feature running, and then application uses it,
> > the socket key would be re-init for application and the offset is the
> > delta.
>
> We need to also figure out the rare conflict when one user sets
> OPT_ID | OPT_ID_TCP while the other only uses OPT_ID.

I think the current patch handles the case because:
1. sock_calculate_tskey_offset() gets the final key first whether the
OPT_ID_TCP is set or not.
2. we will use that tskey to calculate the delta.

>
> It is so obscure, that perhaps we can punt and say that the BPF
> program just has to follow the application preference and be aware of
> the subtle difference.

Right.

>
> > Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx>
> > ---
> >  include/net/sock.h |  1 +
> >  net/core/skbuff.c  | 15 ++++++++---
> >  net/core/sock.c    | 66 ++++++++++++++++++++++++++++++++++++++--------
> >  3 files changed, 68 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 91398b20a4a3..41c6c6f78e55 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -469,6 +469,7 @@ struct sock {
> >       unsigned long           sk_pacing_rate; /* bytes per second */
> >       atomic_t                sk_zckey;
> >       atomic_t                sk_tskey;
> > +     u32                     sk_tskey_bpf_offset;
> >       __cacheline_group_end(sock_write_tx);
> >
> >       __cacheline_group_begin(sock_read_tx);
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 0b571306f7ea..d1739317b97d 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5641,9 +5641,10 @@ void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> >  }
> >
> >  static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
> > +                                  struct sk_buff *skb,
> >                                    struct skb_shared_hwtstamps *hwtstamps)
> >  {
> > -     u32 args[2] = {0, 0};
> > +     u32 args[3] = {0, 0, 0};
> >       u32 tsflags, cb_flag;
> >
> >       tsflags = READ_ONCE(sk->sk_tsflags_bpf);
> > @@ -5672,7 +5673,15 @@ static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
> >               args[1] = ts.tv_nsec;
> >       }
> >
> > -     timestamp_call_bpf(sk, cb_flag, 2, args);
> > +     if (tsflags & SOF_TIMESTAMPING_OPT_ID) {
> > +             args[2] = skb_shinfo(skb)->tskey;
> > +             if (sk_is_tcp(sk))
> > +                     args[2] -= atomic_read(&sk->sk_tskey);
> > +             if (sk->sk_tskey_bpf_offset)
> > +                     args[2] += sk->sk_tskey_bpf_offset;
> > +     }
> > +
> > +     timestamp_call_bpf(sk, cb_flag, 3, args);
>
>
> So the BPF interface is effectively OPT_TSONLY: the packet data is
> never shared.
>
> Then OPT_ID should be mandatory, because it without it the data is
> not actionable: which byte in the bytestream or packet in the case
> of datagram sockets does a callback refer to.

It does make sense, I think I will implement it when bpf_setsockopt() is called.

>
> > +/* Used to track the tskey for bpf extension
> > + *
> > + * @sk_tskey: bpf extension can use it only when no application uses.
> > + *            Application can use it directly regardless of bpf extension.
> > + *
> > + * There are three strategies:
> > + * 1) If we've already set through setsockopt() and here we're going to set
> > + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
> > + *    keep the record of delta between the current "key" and previous key.
> > + * 2) If we've already set through bpf_setsockopt() and here we're going to
> > + *    set for application use, we will record the delta first and then
> > + *    override/initialize the @sk_tskey.
> > + * 3) other cases, which means only either of them takes effect, so initialize
> > + *    everything simplely.
> > + */
>
> Please explain in the commit message that these gymnastics are needed
> because there can only be one tskey in skb_shared_info.

No problem.

>
> > +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > +{
> > +     u32 tskey;
> > +
> > +     if (sk_is_tcp(sk)) {
> > +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > +                     return -EINVAL;
> > +
> > +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > +                     tskey = tcp_sk(sk)->write_seq;
> > +             else
> > +                     tskey = tcp_sk(sk)->snd_una;
> > +     } else {
> > +             tskey = 0;
> > +     }
> > +
> > +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > +             return 0;
> > +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > +     } else {
> > +             sk->sk_tskey_bpf_offset = 0;
> > +     }
> > +
> > +     return tskey;
> > +}
> > +
> >  int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> >  {
> >       u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
> > @@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> >
> >       if (val & SOF_TIMESTAMPING_OPT_ID &&
> >           !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > -             if (sk_is_tcp(sk)) {
> > -                     if ((1 << sk->sk_state) &
> > -                         (TCPF_CLOSE | TCPF_LISTEN))
> > -                             return -EINVAL;
> > -                     if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> > -                     else
> > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> > -             } else {
> > -                     atomic_set(&sk->sk_tskey, 0);
> > -             }
> > +             long int ret;
> > +
> > +             ret = sock_calculate_tskey_offset(sk, val, bpf_type);
> > +             if (ret <= 0)
> > +                     return ret;
> > +
> > +             atomic_set(&sk->sk_tskey, ret);
> >       }
> >
> >       return 0;
> > @@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
> >                                    struct so_timestamping timestamping)
> >  {
> >       u32 flags = timestamping.flags;
> > +     int ret;
> >
> >       if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> >               return -EINVAL;
> >
> > +     ret = sock_set_tskey(sk, flags, 1);
> > +     if (ret)
> > +             return ret;
> > +
> >       WRITE_ONCE(sk->sk_tsflags_bpf, flags);
> >
> >       return 0;
>
> I'm a bit hazy on when this can be called. We can assume that this new
> BPF operation cannot race with the existing setsockopt nor with the
> datapath that might touch the atomic fields, right?

It surely can race with the existing setsockopt.

1)
if (only existing setsockopt works) {
        then sk->sk_tskey is set through setsockopt, sk_tskey_bpf_offset is 0.
}

2)
if (only bpf setsockopt works) {
        then sk->sk_tskey is set through bpf_setsockopt,
sk_tskey_bpf_offset is 0.
}

3)
if (existing setsockopt already started, here we enable the bpf feature) {
        then sk->sk_tskey will not change, but the sk_tskey_bpf_offset
will be calculated.
}

4)
if (bpf setsockopt already started, here we enable the application feature) {
        then sk->sk_tskey will re-initialized/overridden by
setsockopt, and the sk_tskey_bpf_offset will be calculated.
}

Then the skb tskey will use the sk->sk_tskey like before.

At last, when we are about to print in bpf extension if we're allowed
(by testing the sk_tsflags_bpf), we only need to check if
sk_tskey_bpf_offset is zero or not. If the value is zero, it means
only the bpf program runs; if not, it means the sk->sk_tskey servers
for application feature, we need to compute the real bpf tskey. Please
see skb_tstamp_tx_output_bpf().

Above makes sure that two features can work parallelly. It's honestly
a little bit complicated. Before writing this part, I drew a few
pictures to help me understand how it works.

Thanks,
Jason