Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter

Jesper Dangaard Brouer <brouer@xxxxxxxxxx> · Fri, 8 Apr 2016 14:33:40 +0200

On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@xxxxxxxxxx> wrote:

> > +/* user return codes for PHYS_DEV prog type */
> > +enum bpf_phys_dev_action {
> > +	BPF_PHYS_DEV_DROP,
> > +	BPF_PHYS_DEV_OK,
> > +};  
> 
> I can imagine these extra return codes:
> 
>  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
>  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
>  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> 
> The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> which we can look at when we get that far...

I want to point out something which is quite FUNDAMENTAL, for
understanding these return codes (and network stack).

At driver RX time, the network stack basically have two ways of
building an SKB, which is send up the stack.

Option-A (fastest): The packet page is writable. The SKB can be
allocated and skb->data/head can point directly to the page.  And
we place/write skb_shared_info in the end/tail-room. (This is done by
calling build_skb()).

Option-B (slower): The packet page is read-only.  The SKB cannot point
skb->data/head directly to the page, because skb_shared_info need to be
written into skb->end (slightly hidden via skb_shinfo() casting).  To
get around this, a separate piece of memory is allocated (speedup by
__alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
be written. (This is done when calling netdev/napi_alloc_skb()).
  Drivers then need to copy over packet headers, and assign + adjust
skb_shinfo(skb)->frags[0] offset to skip copied headers.

Unfortunately most drivers use option-B.  Due to cost of calling the
page allocator.  It is only slightly most expensive to get a larger
compound page from the page allocator, which then can be partitioned into
page-fragments, thus amortizing the page alloc cost.  Unfortunately the
cost is added later, when constructing the SKB.
 Another reason for option-B, is that archs with expensive IOMMU
requirements (like PowerPC), don't need to dma_unmap on every packet,
but only on the compound page level.

Side-note: Most drivers have a "copy-break" optimization.  Especially
for option-B, when copying header data anyhow. For small packet, one
might as well free (or recycle) the RX page, if header size fits into
the newly allocated memory (for skb_shared_info).

For the early filter drop (DDoS use-case), it does not matter that the
packet-page is read-only.

BUT for the future XDP (eXpress Data Path) use-case it does matter.  If
we ever want to see speeds comparable to DPDK, then drivers to
need to implement option-A, as this allow forwarding at the packet-page
level.

I hope, my future page-pool facility can remove/hide the cost calling
the page allocator.

Back to the return codes, thus:
-------------------------------
BPF_PHYS_DEV_SHARED requires driver use option-B, when constructing
the SKB, and treat packet data as read-only.

BPF_PHYS_DEV_MODIFIED requires driver to provide a writable packet-page.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>