Re: [RFC PATCH bpf-next 0/3] Avoid skipping sockets with socket iterators

Jordan Rife <jrife@xxxxxxxxxx> · Mon, 17 Mar 2025 18:45:05 -0700

Hi Martin,

Thanks a lot for taking a look.

> The batch should have a snapshot of the bucket. Practically, it should not have
> the "repeating" or "missing" a sk issue as long as seq->stop() is not called in
> the middle of the iteration of that batch.
>
> I think this guarantee is enough for the bpf_sock_destroy() and the
> bpf_setsockopt() use case if the bpf prog ends up not seq_write()-ing anything.

Yeah, I realized shortly after sending this out that in the case that
you're purely using the iterator to call bpf_sock_destroy() or
bpf_setsockopt() without any seq_write()s, you would likely never have
to process a bucket in multiple "chunks". Possibly a poor example on
my part :). Although, I have a possibly dumb question even in this
case. Focusing in on just bpf_iter_udp_batch for a second,

>    if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>        resized = true;
>        /* After allocating a larger batch, retry one more time to grab
>         * the whole bucket.
>         */
>        goto again;
>    }

Barring the possibility that bpf_iter_udp_realloc_batch() fails to
grab more memory (should this basically never happen?), this should
ensure that we always grab the full contents of the bucket on the
second go around. However, the spin lock on hslot2->lock is released
before doing this. Would it not be more accurate to hold onto the lock
until after the second attempt, so we know the size isn't changing
between the time where we release the lock and the time when we
reacquire it post-batch-resize. The bucket size would have to grow by
more than 1.5x for the new size to be insufficient, so I may be
splitting hairs here, but just something I noticed.

But yeah, iterators that also print/dump are the main concern.

> One thought is to avoid seq->stop() which will require to do batching again next
> time, in particular, when the user has provided large buf to read() to ensure it
> is large enough for one bucket. May be we can return whatever seq->buf has to
> the userspace whenever a bucket/batch is done. This will have perf tradeoff
> though and not sure how the userspace can optin.

Hmmm, not sure if I understand here. As you say below, don't we have
to use stop to deref the sk between reads?

> Another random thought is in seq->stop (bpf_iter_{tcp,udp}_seq_stop). It has to
> release the sk refcnt because we don't know when will the userspace come back to
> read(). Will it be useful if it stores the cookies of the sk(s) that have not
> yet seq->show?

Keeping track of all the cookies we've seen or haven't seen in the
current bucket would indeed allow us to avoid skipping any we haven't
seen or repeating those we have on subsequent reads. I had considered
something like this, but had initially avoided it, since I didn't want
to dynamically allocate (and reallocate) additional memory to keep
track of cookies. I also wasn't sure if this would be acceptable
performance-wise, and of course the allocation can fail in which case
you're back to square one. Although, I may be imagining a different
implementation than you. In fact, this line of thinking led me to the
approach proposed in the RFC which basically tries to
accurately/efficiently track everything that's already seen without
remembering all the cookies or allocating any additional buffers. This
might make a good alternative if the concerns I listed aren't a big
deal.

Thanks!

-Jordan