Re: [RFC PATCH bpf-next 0/3] Avoid skipping sockets with socket iterators

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Tue, 18 Mar 2025 17:30:37 -0700

On 3/17/25 6:45 PM, Jordan Rife wrote:
Hi Martin,

Thanks a lot for taking a look.

The batch should have a snapshot of the bucket. Practically, it should not have
the "repeating" or "missing" a sk issue as long as seq->stop() is not called in
the middle of the iteration of that batch.

I think this guarantee is enough for the bpf_sock_destroy() and the
bpf_setsockopt() use case if the bpf prog ends up not seq_write()-ing anything.

Yeah, I realized shortly after sending this out that in the case that
you're purely using the iterator to call bpf_sock_destroy() or
bpf_setsockopt() without any seq_write()s, you would likely never have
to process a bucket in multiple "chunks". Possibly a poor example on
my part :). Although, I have a possibly dumb question even in this
case. Focusing in on just bpf_iter_udp_batch for a second,

    if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
        resized = true;
        /* After allocating a larger batch, retry one more time to grab
         * the whole bucket.
         */
        goto again;
    }

Barring the possibility that bpf_iter_udp_realloc_batch() fails to
grab more memory (should this basically never happen?), this should
ensure that we always grab the full contents of the bucket on the
second go around. However, the spin lock on hslot2->lock is released
before doing this. Would it not be more accurate to hold onto the lock
until after the second attempt, so we know the size isn't changing
between the time where we release the lock and the time when we
reacquire it post-batch-resize. The bucket size would have to grow by
more than 1.5x for the new size to be insufficient, so I may be
splitting hairs here, but just something I noticed.

It is a very corner case.

I guess it can with GFP_ATOMIC. I just didn't think it was needed considering 
the key of the hash is addresses+ports. If we have many socks collided on the 
same addresses+ports bucket, that would be a better hashtable problem to solve 
first.

The default batch size is 16 now. On the listening hashtable + SO_REUSEPORT, 
userspace may have one listen sk binding on the same address+port for each 
thread. It is not uncommon to have hundreds of CPU cores now, so it may actually 
need to hit the realloc_batch() path once and then likely will stay at that size 
for the whole hashtable iteration.

But yeah, iterators that also print/dump are the main concern.

One thought is to avoid seq->stop() which will require to do batching again next
time, in particular, when the user has provided large buf to read() to ensure it
is large enough for one bucket. May be we can return whatever seq->buf has to
the userspace whenever a bucket/batch is done. This will have perf tradeoff
though and not sure how the userspace can optin.

Hmmm, not sure if I understand here. As you say below, don't we have
to use stop to deref the sk between reads?

I was thinking the case that, e.g. the userspace's buf may be large enough for 
>1 bucket but <2 buckets. Then the 2nd bucket is only half done and requires a 
stop().

Another random thought is in seq->stop (bpf_iter_{tcp,udp}_seq_stop). It has to
release the sk refcnt because we don't know when will the userspace come back to
read(). Will it be useful if it stores the cookies of the sk(s) that have not
yet seq->show?

Keeping track of all the cookies we've seen or haven't seen in the
current bucket would indeed allow us to avoid skipping any we haven't
seen or repeating those we have on subsequent reads. I had considered

I don't think we need to worry about the newly added sk after the iteration 
started. This chase on newly added sk may not end.

Only need to avoid iterating the same sk.

something like this, but had initially avoided it, since I didn't want
to dynamically allocate (and reallocate) additional memory to keep
track of cookies. I also wasn't sure if this would be acceptable
performance-wise, and of course the allocation can fail in which case
you're back to square one. Although, I may be imagining a different
implementation than you. In fact, this line of thinking led me to the
approach proposed in the RFC which basically tries to
accurately/efficiently track everything that's already seen without
remembering all the cookies or allocating any additional buffers. This
might make a good alternative if the concerns I listed aren't a big
deal.

I don't think memory will be a concern if we are talking about another array 
(for one bucket) is needed for storing cookies. This may actually be the right 
tradeoff instead of modifying the common networking code path or adding fields 
to struct sock.

We can also consider if the same sk batch array can be reused to store cookies 
during stop(). If the array can reuse, it would be nice but not a blocker imo.

In term of slowness, the worst will be all the previous stored cookies cannot be 
found in the updated bucket? Not sure how often a socket is gone and how often 
there is a very large bucket (as mentioned at the hashtable's key earlier), so 
should not be an issue for iteration use case? I guess it may need some rough 
PoC code to judge if it is feasible.