Re: [PATCH v2 bpf-next 12/12] selftests/bpf: add kfree_skb raw_tp test

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 10 Oct 2019 12:07:20 -0700

On Thu, Oct 10, 2019 at 02:07:29PM +0300, Ido Schimmel wrote:
> On Wed, Oct 09, 2019 at 09:15:03PM -0700, Alexei Starovoitov wrote:
> > +SEC("raw_tracepoint/kfree_skb")
> > +int trace_kfree_skb(struct trace_kfree_skb *ctx)
> > +{
> > +	struct sk_buff *skb = ctx->skb;
> > +	struct net_device *dev;
> > +	int ifindex;
> > +	struct callback_head *ptr;
> > +	void *func;
> > +
> > +	__builtin_preserve_access_index(({
> > +		dev = skb->dev;
> > +		ifindex = dev->ifindex;
> 
> Hi Alexei,
> 
> The patchset looks very useful. One question: Is it always safe to
> access 'skb->dev->ifindex' here? I'm asking because 'dev' is inside a
> union with 'dev_scratch' which is 'unsigned long' and therefore might
> not always be a valid memory address. Consider for example the following
> code path:
> 
> ...
> __udp_queue_rcv_skb(sk, skb)
> 	__udp_enqueue_schedule_skb(sk, skb)
> 		udp_set_dev_scratch(skb)
> 		// returns error
> 	...
> 	kfree_skb(skb) // ebpf program is invoked
> 
> How is this handled by eBPF?

Excellent question. There are cases like this where the verifier cannot possibly
track semantics of the kernel code and union of pointer with scratch area
like this. That's why every access through btf pointer is a hidden probe_read.
Comparing to old school tracing all memory accesses were probe_read
and bpf prog was free to read anything and page fault everywhere.
Now bpf prog will almost always access correct data. Yet corner cases like
this are inevitable. I'm working on few ideas how to improve it further
with btf-tagged slab allocations and kasan-like memory shadowing.

Your question made me thinking whether we have a long standing issue
with dev_scratch, since even classic bpf has SKF_AD_IFINDEX hack
which is implemented as:
    *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, dev),
                          BPF_REG_TMP, BPF_REG_CTX,
                          offsetof(struct sk_buff, dev));
    /* if (tmp != 0) goto pc + 1 */
    *insn++ = BPF_JMP_IMM(BPF_JNE, BPF_REG_TMP, 0, 1);
    *insn++ = BPF_EXIT_INSN();
    if (fp->k == SKF_AD_OFF + SKF_AD_IFINDEX)
            *insn = BPF_LDX_MEM(BPF_W, BPF_REG_A, BPF_REG_TMP,
                                offsetof(struct net_device, ifindex));

That means for long time [c|e]BPF code was checking skb->dev for NULL only.
I've analyzed the code where socket filters can be called and I think it's good
there. dev_scratch is used after sk_filter has run.
But there are other hooks: lwt, various cgroups.
I've checked lwt and cgroup inet/egress. I think dev_scratch should not be
used in these paths. So should be good there as well.
But I think the whole idea of aliasing scratch into 'dev' pointer is dangerous.
There are plenty of tracepoints that do skb->dev->foo. Hard to track where
everything is called.
I think udp code need to move this dev_scratch into some other place in skb.