On Thu, Mar 03, 2022 at 02:00:37PM +0100, Daniel Borkmann wrote: > On 3/2/22 8:56 PM, Martin KaFai Lau wrote: > > If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time > > has to be cleared also. It could be done together during > > convert_ctx_access(). However, the latter patch will also expose > > the skb->mono_delivery_time bit as __sk_buff->delivery_time_type. > > Changing the delivery_time_type in the background may surprise > > the user, e.g. the 2nd read on __sk_buff->delivery_time_type > > may need a READ_ONCE() to avoid compiler optimization. Thus, > > in expecting the needs in the latter patch, this patch does a > > check on !skb->tstamp after running the tc-bpf and clears the > > skb->mono_delivery_time bit if needed. The earlier discussion > > on v4 [0]. [ ... ] > > @@ -1047,10 +1047,16 @@ struct sk_buff { > > /* if you move pkt_vlan_present around you also must adapt these constants */ > > #ifdef __BIG_ENDIAN_BITFIELD > > #define PKT_VLAN_PRESENT_BIT 7 > > +#define TC_AT_INGRESS_MASK (1 << 0) > > +#define SKB_MONO_DELIVERY_TIME_MASK (1 << 2) > > #else > > #define PKT_VLAN_PRESENT_BIT 0 > > +#define TC_AT_INGRESS_MASK (1 << 7) > > +#define SKB_MONO_DELIVERY_TIME_MASK (1 << 5) > > #endif > > #define PKT_VLAN_PRESENT_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) > > +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) > > +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset) > > Just nit, but given PKT_VLAN_PRESENT_OFFSET, TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET > are all the same offsetof(struct sk_buff, __pkt_vlan_present_offset), maybe lets use just one single > define? If anyone moves them out, they would have to adopt as per comment. Make sense. I will update the comment, remove these two defines and reuse the PKT_VLAN_PRESENT_OFFSET. Considering it is more bpf insn rewrite specific, I will do a follow-up in filter.c and skbuff.h at bpf-next. > > #ifdef __KERNEL__ > > /* > > diff --git a/net/core/filter.c b/net/core/filter.c > > index cfcf9b4d1ec2..5072733743e9 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -8859,6 +8859,65 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si, > > return insn; > > } > > +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si, > > + struct bpf_insn *insn) > > +{ > > + __u8 value_reg = si->dst_reg; > > + __u8 skb_reg = si->src_reg; > > + > > +#ifdef CONFIG_NET_CLS_ACT > > + __u8 tmp_reg = BPF_REG_AX; > > + > > + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); > > + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); > > nit: As far as I can see, can't si->dst_reg be used instead of AX? Ah. This one got me also when using dst_reg as a tmp. dst_reg and src_reg can be the same: ; skb->tstamp == EGRESS_FWDNS_MAGIC) 169: r1 = *(u64 *)(r1 + 152) > > > + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5); > > + /* @ingress, read __sk_buff->tstamp as the (rcv) timestamp, > > + * so check the skb->mono_delivery_time. > > + */ > > + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, > > + SKB_MONO_DELIVERY_TIME_OFFSET); > > + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, > > + SKB_MONO_DELIVERY_TIME_MASK); > > + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2); > > + /* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */ > > + *insn++ = BPF_MOV64_IMM(value_reg, 0); > > + *insn++ = BPF_JMP_A(1); > > +#endif > > + > > + *insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg, > > + offsetof(struct sk_buff, tstamp)); > > + return insn; > > +} > > + > > +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si, > > + struct bpf_insn *insn) > > +{ > > + __u8 value_reg = si->src_reg; > > + __u8 skb_reg = si->dst_reg; > > + > > +#ifdef CONFIG_NET_CLS_ACT > > + __u8 tmp_reg = BPF_REG_AX; > > + > > + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET); > > + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK); > > Can't we get rid of tcf_bpf_act() and cls_bpf_classify() changes altogether by just doing: > > /* BPF_WRITE: __sk_buff->tstamp = a */ > skb->mono_delivery_time = !skb->tc_at_ingress && a; > skb->tstamp = a; It will then assume the bpf prog is writing a mono time. Although mono should always be the case now, this assumption will be an issue in the future if we need to support non-mono. > > (Untested) pseudo code: > > // or see comment on common SKB_FLAGS_OFFSET define or such > BUILD_BUG_ON(TC_AT_INGRESS_OFFSET != SKB_MONO_DELIVERY_TIME_OFFSET) > > BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, SKB_MONO_DELIVERY_TIME_OFFSET) > BPF_ALU32_IMM(BPF_OR, tmp_reg, SKB_MONO_DELIVERY_TIME_MASK) > BPF_JMP32_IMM(BPF_JSET, tmp_reg, TC_AT_INGRESS_MASK, <clear>) This can save a BPF_ALU32_IMM(BPF_AND). I will do that together in the follow up. Thanks for the idea ! > BPF_JMP32_REG(BPF_JGE, value_reg, tmp_reg, <store>) > <clear>: > BPF_ALU32_IMM(BPF_AND, tmp_reg, ~SKB_MONO_DELIVERY_TIME_MASK) > <store>: > BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, SKB_MONO_DELIVERY_TIME_OFFSET) > BPF_STX_MEM(BPF_DW, skb_reg, value_reg, offsetof(struct sk_buff, tstamp)) > > (There's a small hack with the BPF_JGE for tmp_reg, so constant blinding for AX doesn't > get into our way.) > > > + *insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3); > > + /* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp. > > + * Clear the skb->mono_delivery_time. > > + */ > > + *insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, > > + SKB_MONO_DELIVERY_TIME_OFFSET); > > + *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, > > + ~SKB_MONO_DELIVERY_TIME_MASK); > > + *insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, > > + SKB_MONO_DELIVERY_TIME_OFFSET); > > +#endif > > + > > + /* skb->tstamp = tstamp */ > > + *insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg, > > + offsetof(struct sk_buff, tstamp)); > > + return insn; > > +} > > +