On 11/16/22 2:08 AM, Toke Høiland-Jørgensen wrote:
Martin KaFai Lau <martin.lau@xxxxxxxxx> writes:
On 11/15/22 10:38 PM, John Fastabend wrote:
+static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+ struct bpf_patch *patch)
+{
+ if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+ /* return true; */
+ bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+ } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+ /* return ktime_get_mono_fast_ns(); */
+ bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
+ }
+}
So these look reasonable enough, but would be good to see some examples
of kfunc implementations that don't just BPF_CALL to a kernel function
(with those helper wrappers we were discussing before).
Let's maybe add them if/when needed as we add more metadata support?
xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
examples, so it shouldn't be a problem to resurrect them back at some
point?
Well, the reason I asked for them is that I think having to maintain the
BPF code generation in the drivers is probably the biggest drawback of
the kfunc approach, so it would be good to be relatively sure that we
can manage that complexity (via helpers) before we commit to this :)
Right, and I've added a bunch of examples in v2 rfc so we can judge
whether that complexity is manageable or not :-)
Do you want me to add those wrappers you've back without any real users?
Because I had to remove my veth tstamp accessors due to John/Jesper
objections; I can maybe bring some of this back gated by some
static_branch to avoid the fastpath cost?
I missed the context a bit what did you mean "would be good to see some
examples of kfunc implementations that don't just BPF_CALL to a kernel
function"? In this case do you mean BPF code directly without the call?
Early on I thought we should just expose the rx_descriptor which would
be roughly the same right? (difference being code embedded in driver vs
a lib) Trouble I ran into is driver code using seqlock_t and mutexs
which wasn't as straight forward as the simpler just read it from
the descriptor. For example in mlx getting the ts would be easy from
BPF with the mlx4_cqe struct exposed
u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
{
u64 hi, lo;
struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
return hi | lo;
}
but converting that to nsec is a bit annoying,
void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
struct skb_shared_hwtstamps *hwts,
u64 timestamp)
{
unsigned int seq;
u64 nsec;
do {
seq = read_seqbegin(&mdev->clock_lock);
nsec = timecounter_cyc2time(&mdev->clock, timestamp);
} while (read_seqretry(&mdev->clock_lock, seq));
memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
hwts->hwtstamp = ns_to_ktime(nsec);
}
I think the nsec is what you really want.
With all the drivers doing slightly different ops we would have
to create read_seqbegin, read_seqretry, mutex_lock, ... to get
at least the mlx and ice drivers it looks like we would need some
more BPF primitives/helpers. Looks like some more work is needed
on ice driver though to get rx tstamps on all packets.
Anyways this convinced me real devices will probably use BPF_CALL
and not BPF insns directly.
Some of the mlx5 path looks like this:
#define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
u64 timestamp)
{
u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
return ns_to_ktime(time);
}
If some hints are harder to get, then just doing a kfunc call is better.
Sure, but if we end up having a full function call for every field in
the metadata, that will end up having a significant performance impact
on the XDP data path (thinking mostly about the skb metadata case here,
which will collect several bits of metadata).
csum may have a better chance to inline?
It will be useful if the skb_metadata can get at least one more field like csum
or vlan.
Yup, I agree. Including that also makes it possible to benchmark this
series against Jesper's; which I think we should definitely be doing
before merging this.
If the hint needs a lock to obtain it, it is not cheap to begin with regardless
of kfunc or not. Also, there is bpf_xdp_metadata_rx_timestamp_supported()
before doing the bpf_xdp_metadata_rx_timestamp().
This set gives the xdp prog a flexible way to avoid getting all hints (some of
them require a lock) if all the xdp prog needs is a csum. imo, giving this
flexibility to the xdp prog is the important thing for this set in terms of
performance.
Regardless, BPF in-lining is a well solved problem and used in many
bpf helpers already, so there are many examples in the kernel. I don't
think it is necessary to block this series because of missing some
helper wrappers for inlining. The driver can always start with the
simpler kfunc call first and optimize later if some hints from the
drivers allow it.
Well, "solved" in the sense of "there are a few handfuls of core BPF
people who know how to do it". My concern is that we'll end up with
either the BPF devs having to maintain all these bits of BPF byte code
in all the drivers; or drivers just punting to regular function calls
because the inlining is too complicated, with sub-par performance as per
the above. I don't think we should just hand-wave this away as "solved",
but rather treat this as an integral part of the initial series.
In terms of complexity/maintainability, I don't think the driver needs to inline
like hundreds of bpf insn for a hint which is already a good signal that it
should just call kfunc. For a simple hint, xdp_metadata_export_to_skb() is a
good example to start with and I am not sure how much a helper wrapper can do to
simplify things further since each driver inline code is going to be different.
The community's review will definitely be useful for the first few drivers when
the driver changes is posted to the list, and the latter drivers will have
easier time to follow like the xdp was initially introduced to the few drivers
first. When there are things to refactor after enough drivers supporting it,
that will be a better time to revisit.