On Wed, Feb 16, 2022 at 05:50:43PM -0800, Martin KaFai Lau wrote: > On Thu, Feb 17, 2022 at 01:03:21AM +0100, Daniel Borkmann wrote: > > On 2/16/22 6:51 AM, Martin KaFai Lau wrote: > > > On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote: > > > > On 2/11/22 8:13 AM, Martin KaFai Lau wrote: > > > > > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp > > > > > as a (rcv) timestamp. This patch is to backward compatible with the > > > > > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time. > > > > > > > > > > If needed, the patch first saves the mono delivery_time. Depending on > > > > > the static key "netstamp_needed_key", it then resets the skb->tstamp to > > > > > either 0 or ktime_get_real() before running the tc-bpf@ingress. After > > > > > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not > > > > > been changed, it will restore the earlier saved mono delivery_time. > > > > > > > > > > The current logic to run tc-bpf@ingress is refactored to a new > > > > > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf. > > > > > The above new delivery_time save/restore logic is also done together in > > > > > this function. > > > > > > > > > > Signed-off-by: Martin KaFai Lau <kafai@xxxxxx> > > > > > --- > > > > > include/linux/filter.h | 28 ++++++++++++++++++++++++++++ > > > > > net/sched/act_bpf.c | 5 +---- > > > > > net/sched/cls_bpf.c | 6 +----- > > > > > 3 files changed, 30 insertions(+), 9 deletions(-) > > > > > > > > > > diff --git a/include/linux/filter.h b/include/linux/filter.h > > > > > index d23e999dc032..e43e1701a80e 100644 > > > > > --- a/include/linux/filter.h > > > > > +++ b/include/linux/filter.h > > > > > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) > > > > > cb->data_end = skb->data + skb_headlen(skb); > > > > > } > > > > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog, > > > > > + struct sk_buff *skb) > > > > > +{ > > > > > + ktime_t tstamp, saved_mono_dtime = 0; > > > > > + int filter_res; > > > > > + > > > > > + if (unlikely(skb->mono_delivery_time)) { > > > > > + saved_mono_dtime = skb->tstamp; > > > > > + skb->mono_delivery_time = 0; > > > > > + if (static_branch_unlikely(&netstamp_needed_key)) > > > > > + skb->tstamp = tstamp = ktime_get_real(); > > > > > + else > > > > > + skb->tstamp = tstamp = 0; > > > > > + } > > > > > + > > > > > + /* It is safe to push/pull even if skb_shared() */ > > > > > + __skb_push(skb, skb->mac_len); > > > > > + bpf_compute_data_pointers(skb); > > > > > + filter_res = bpf_prog_run(prog, skb); > > > > > + __skb_pull(skb, skb->mac_len); > > > > > + > > > > > + /* __sk_buff->tstamp was not changed, restore the delivery_time */ > > > > > + if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp) > > > > > + skb_set_delivery_time(skb, saved_mono_dtime, true); > > > > > > > > So above detour is for skb->tstamp backwards compatibility so users will see real time. > > > > I don't see why we special case {cls,act}_bpf-only, given this will also be the case > > > > for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get > > > > the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time(). > > > > > > > > If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore > > > > detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special > > > > case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..) > > > The limitation here is there is only one skb->tstamp field. I don't see > > > a bpf-only solution or not will make a difference here. > > > > A BPF-only solution would probably just treat the skb->tstamp as (semi-)opaque, > > meaning, there're no further bits on clock type needed in skb, but given the > > environment is controlled by an orchestrator it can decide which tstamps to > > retain or which to reset (e.g. by looking at skb->sk). (The other approach is > > exposing info on clock base as done here to some degree for mono/real.) > hmm... I think we may be talking about different things. > > Using a bit or not still does not change the fact that > there is only one skb->tstamp field which may have a delivery > time or rcv tstamp. If the delivery time is reset before > forwarding to ingress or the delivery time was never there, then > it will be stamped with the rcv timestamp at ingress. > The bpf needs a way to distinguish between them. > skb->sk can at most tell the clock base if skb->tstamp > does indeed have the delivery_time. > > > > > > Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c > > > and nfnetlink_queue.c. Like the tapping cases (earlier than the bpf run-point) > > > and in general other ingress cases, it cannot assume the rcv timestamp is > > > always there, so they can be changed like af_packet in patch 3 > > > which is a straight forward change. I can make the change in v5. > > > > > > Going back to the cls_bpf at ingress. If the concern is on code cleanliness, > > > how about removing this dance for now while the current rcv tstamp usage is > > > unclear at ingress. Meaning keep the delivery_time (if any) in skb->tstamp. > > > This dance could be brought in later when there was breakage and legit usecase > > > reported. The new bpf prog will have to use the __sk_buff->delivery_time_type > > > regardless if it wants to use skb->tstamp as the delivery_time, so they won't > > > assume delivery_time is always in skb->tstamp and it will be fine even this > > > dance would be brought back in later. > > > > Yes, imho, this is still better than the bpf_prog_run_at_ingress() workaround. > ic. so it is ok to remove the mono dtime save/restore logic here and only brought > back in if there was legit breakage reported? Another idea on this which I think is a better middle ground solution to remove the dance here. When reading __sk_buff->tstamp, the verifier can do a rewrite if needed. The rewrite will depend on whether the __sk_buff->delivery_time_type has been read or not. If delivery_time_type is not read, it will rewrite to this: /* BPF_READ: __u64 a = __sk_buff->tstamp; */ if (!skb->tc_at_ingress || !skb->mono_delivery_time) a = skb->tstamp; else a = 0; That will be consistent with other kernel ingress path expectation (either 0 or rcv tstamp). If __sk_buff->delivery_time_type is read, no rewrite is needed and skb->tstamp will be read as is. > btw, did you look at patch 7 which added the __sk_buff->delivery_time_type > and bpf_set_delivery_time()? > > > Ideally, we know when we call helpers like ktime_get_ns() that the clock will > > be mono. We could track that on verifier side in the register type, and when we > > end up writing to skb->tstamp, we could implicitly also set the clock base bits > > in skb for the ctx rewrite telling that it's of type 'mono'. Same for reading, > > we could add __sk_buff->tstamp_type which program can access (imo tstamp_type > > is more generic than a __sk_buff->delivery_time_type). If someone needs > > ktime_get_clocktai_ns() for sch_etf in future, it could be similar tracking > > mechanism. Also setting skb->tstamp to 0 ... > hmm... I think it is talking about a way to automatically > update the __sk_buff->delivery_time_type (mono_delivery_time bit) and > also avoid adding the new bpf_set_delivery_time() helper in patch 7? > > It may have case that time is not always from helper ktime_get_ns() and > cannot be decided statically. e.g. what if we want to set the current > skb->tstamp based on when the previous skb was sent in a cgroup. There > will be cases coming up that require runtime decision. > > Also, imo, it may be a surprise behavior for the user who only > changed __skb_buff->tstamp but then figured out > __sk_buff->delivery_time_type is also changed in > the background. > > Beside, not sure if the compiler will optimize the 2nd read on > __sk_buff->delivery_time_type. The bpf may need a > READ_ONCE(__sk_buff->delivery_time_type) after writing __skb_buff->tstamp. > We can add volatile to the delivery_time_type in the UAPI but I think > it is overkill. > > It is better to treat tstamp and delivery_time_type separately > for direct access, and have the skb_set_delivery_time() to change > both of them together. Also, more checks can be done in > skb_set_delivery_time() which is more flexible to return > errors. > > For TCP, it will be already in mono, so skb_set_delivery_time() > is usually not needed. > > Regarding the name delivery_time_type vs tstamp_type, I thought about > that and finally picked the delivery_time_type because I want > a clear separation from rcv tstamp for now until there is more > clarity on how rcv tstamp is used in tc-bpf. > > > > Regarding patch 6, it is unrelated. It needs to clear the > > > mono_delivery_time bit if the bpf writes 0 to the skb->tstamp. > > > > ... doesn't need to be done as code after bpf_prog_run(), but should be brought > > closer to when we write to the ctx where verifier generates the relevant insns. > > Imo, that's better than having this outside in bpf_prog_run() which is then > > checked no matter what program was doing or even accessing tstamp. > This will also change the mono_delivery_time bit in the background > and will be similar to the above points.