RE: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper

John Fastabend <john.fastabend@xxxxxxxxx> · Thu, 03 Sep 2020 23:47:14 -0700

Lorenzo Bianconi wrote:
> Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
> headers moving *offset* bytes from/to the second buffer to/from the
> first one.
> This helper can be used to move headers when the hw DMA SG is not able
> to copy all the headers in the first fragment and split header and data
> pages.
> 
> Signed-off-by: Lorenzo Bianconi <lorenzo@xxxxxxxxxx>
> ---
>  include/uapi/linux/bpf.h       | 25 ++++++++++++----
>  net/core/filter.c              | 54 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
>  3 files changed, 95 insertions(+), 10 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 8dda13880957..c4a6d245619c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3571,11 +3571,25 @@ union bpf_attr {
>   *		value.
>   *
>   * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
> - * 	Description
> - * 		Read *size* bytes from user space address *user_ptr* and store
> - * 		the data in *dst*. This is a wrapper of copy_from_user().
> - * 	Return
> - * 		0 on success, or a negative error in case of failure.
> + *	Description
> + *		Read *size* bytes from user space address *user_ptr* and store
> + *		the data in *dst*. This is a wrapper of copy_from_user().
> + *
> + * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
> + *	Description
> + *		Adjust frame headers moving *offset* bytes from/to the second
> + *		buffer to/from the first one. This helper can be used to move
> + *		headers when the hw DMA SG does not copy all the headers in
> + *		the first fragment.

This is confusing to read. Does this mean I can "move bytes to the second
buffer from the first one" or "move bytes from the second buffer to the first
one" And what are frame headers? I'm sure I can read below code and work
it out, but users reading the header file should be able to parse this.

Also we want to be able to read all data not just headers. Reading the
payload of a TCP packet is equally important for many l7 load balancers.

> + *
> + *		A call to this helper is susceptible to change the underlying
> + *		packet buffer. Therefore, at load time, all checks on pointers
> + *		previously done by the verifier are invalidated and must be
> + *		performed again, if the helper is used in combination with
> + *		direct packet access.
> + *
> + *	Return
> + *		0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3727,6 +3741,7 @@ union bpf_attr {
>  	FN(inode_storage_delete),	\
>  	FN(d_path),			\
>  	FN(copy_from_user),		\
> +	FN(xdp_adjust_mb_header),	\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 47eef9a0be6a..ae6b10cf062d 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3475,6 +3475,57 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>  	.arg2_type	= ARG_ANYTHING,
>  };
>  
> +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> +	   int, offset)
> +{
> +	void *data_hard_end, *data_end;
> +	struct skb_shared_info *sinfo;
> +	int frag_offset, frag_len;
> +	u8 *addr;
> +
> +	if (!xdp->mb)
> +		return -EOPNOTSUPP;
> +
> +	sinfo = xdp_get_shared_info_from_buff(xdp);
> +
> +	frag_len = skb_frag_size(&sinfo->frags[0]);
> +	if (offset > frag_len)
> +		return -EINVAL;

What if we want data in frags[1] and so on.

> +
> +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> +	data_end = xdp->data_end + offset;
> +
> +	if (offset < 0 && (-offset > frag_offset ||
> +			   data_end < xdp->data + ETH_HLEN))
> +		return -EINVAL;
> +
> +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> +	if (data_end > data_hard_end)
> +		return -EINVAL;
> +
> +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> +	if (offset > 0) {
> +		memcpy(xdp->data_end, addr, offset);

But this could get expensive for large amount of data? And also
limiting because we require the room to do the copy. Presumably
the reason we have fargs[1] is because the normal page or half
page is in use?

> +	} else {
> +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> +		memset(xdp->data_end + offset, 0, -offset);
> +	}
> +
> +	skb_frag_size_sub(&sinfo->frags[0], offset);
> +	skb_frag_off_add(&sinfo->frags[0], offset);
> +	xdp->data_end = data_end;
> +
> +	return 0;
> +}

So overall I don't see the point of copying bytes from one frag to
another. Create an API that adjusts the data pointers and then
copies are avoided and manipulating frags is not needed.

Also and even more concerning I think this API requires the
driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
this means we need to populate shinfo when its probably not ever
used. If our driver is smart L2/L3 headers are in the readable
data and prefetched. Writing frags into the end of a page is likely
not free.

Did you benchmark this?

In general users of this API should know the bytes they want
to fetch. Use an API like this,

  bpf_xdp_adjust_bytes(xdp, start, end)

Where xdp is the frame, start is the first byte the user wants
and end is the last byte. Then usually you can skip the entire
copy part and just move the xdp pointesr around. The ugly case
is if the user puts start/end across a frag boundary and a copy
is needed. In that case maybe we use end as a hint and not a
hard requirement.

The use case I see is I read L2/L3/L4 headers and I need the
first N bytes of the payload. I know where the payload starts
and I know how many bytes I need so,

  bpf_xdp_adjust_bytes(xdp, payload_offset, bytes_needed);

Then hopefully that is all in one frag. If its not I'll need
to do a second helper call. Still nicer than forcing drivers
to populate this shinfo I think. If you think I'm wrong try
a benchmark. Benchmarks aside I get stuck when data_end and
data_hard_end are too close together.

Thanks,
John