Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs

Joanne Koong <joannelkoong@xxxxxxxxx> · Wed, 3 Aug 2022 18:05:44 -0700

On Wed, Aug 3, 2022 at 4:25 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
>
> On Wed, 3 Aug 2022 13:29:37 -0700 Joanne Koong wrote:
> > Thinking about this some more, I think BPF_FUNC_dynptr_from_skb needs
> > to be patched regardless in order to set the rd-only flag in the
> > metadata for the dynptr. There will be other helper functions that
> > write into dynptrs (eg memcpy with dynptrs, strncpy with dynptrs,
> > probe read user with dynptrs, ...) so I think it's more scalable if we
> > reject these writes at runtime through the rd-only flag in the
> > metadata, than for the verifier to custom-case that any helper funcs
> > that write into dynptrs will need to get dynptr type + do
> > may_access_direct_pkt_data() if it's type skb or xdp. The
> > inconsistency between not rd-only in metadata vs. rd-only in verifier
> > might be a little confusing as well.
> >
> > For these reasons, I'm leaning more towards having bpf_dynptr_write()
> > and other dynptr write helper funcs be rejected at runtime instead of
> > prog load time, but I'm eager to hear what you prefer.
> >
> > What are your thoughts?
>
> Oh. I thought dynptrs are an extension of the discussion we had about
> creating a skb_header_pointer()-like abstraction but it sounds like
> we veered quite far off that track at some point :(

I think the problem is that the skb may be cloned, so a write into any
portion of the paged data requires a pull. If it weren't for this,
then we could do the write and checksumming without pulling (eg kmap
the page, get the csum_partial of the bytes you'll write over, do the
write, get the csum_partial of the bytes you just wrote, then unkmap,
then update skb->csum to be skb->csum - csum of the bytes you wrote
over + csum of the bytes you wrote). I think we would even be able to
provide a direct data slice to non-contiguous pages without needing
the additional copy to a stack buffer (eg kmap the non-contiguous
pages to a contiguous virtual address that we pass back to the bpf
program, and then when the bpf program is finished do the cleanup for
the mappings).

Three ideas I'm thinking through as a possible solution:
1) Enforce that the skb is always uncloned for skb-type bpf progs (we
currently do this just for the skb head, see bpf_unclone_prologue()),
but I'm not sure if the trade-off (pulling all the packet data, even
if it won't be used) is acceptable.

2) Don't support cloned skbs for bpf_dynptr_write/data and don't do
any pulling. If the prog wants to use bpf_dynptr_write/data, then they
have to pull it first

2) (uglier than #1 and #2) For bpf_dynptr_write()s, pull if the write
is to a paged area and the skb is cloned, otherwise write to the paged
area without pulling; if we do this, then we always have to invalidate
all data slices associated with the skb (even for writes to the head)
since at prog load time, the verifier doesn't know if the pull happens
or not. For bpf_dynptr_data()s, follow the same policy.

I'm leaning towards 2. What are your thoughts?

>
> The point of skb_header_pointer() is to expose the chunk of the packet
> pointed to by [skb, offset, len] as a linear buffer. Potentially coping
> it out to a stack buffer *IIF* the header is not contiguous inside the
> skb head, which should very rarely happen.
>
> Here it seems we return an error so that user must pull if the data is
> not linear, which is defeating the purpose. The user of
> skb_header_pointer() wants to avoid the copy while _reliably_ getting
> a contiguous pointer. Plus pulling in the header may be far more
> expensive than a small copy to the stack.
>
> The pointer returned by skb_header_pointer is writable, but it's not
> guaranteed that the writes go to the packet, they may go to the
> on-stack buffer, so the caller must do some sort of:
>
>         if (data_ptr == stack_buf)
>                 skb_store_bits(...);
>
> Which we were thinking of wrapping in some sort of flush operation.
>
> If I'm reading this right dynptr as implemented here do not provide
> such semantics, am I confused in thinking that this is a continuation
> of the XDP multi-buff discussion? Is it a completely separate thing
> and we'll still need a header_pointer like helper?