Hi Martin, Thanks a lot for taking a look. > The batch should have a snapshot of the bucket. Practically, it should not have > the "repeating" or "missing" a sk issue as long as seq->stop() is not called in > the middle of the iteration of that batch. > > I think this guarantee is enough for the bpf_sock_destroy() and the > bpf_setsockopt() use case if the bpf prog ends up not seq_write()-ing anything. Yeah, I realized shortly after sending this out that in the case that you're purely using the iterator to call bpf_sock_destroy() or bpf_setsockopt() without any seq_write()s, you would likely never have to process a bucket in multiple "chunks". Possibly a poor example on my part :). Although, I have a possibly dumb question even in this case. Focusing in on just bpf_iter_udp_batch for a second, > if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) { > resized = true; > /* After allocating a larger batch, retry one more time to grab > * the whole bucket. > */ > goto again; > } Barring the possibility that bpf_iter_udp_realloc_batch() fails to grab more memory (should this basically never happen?), this should ensure that we always grab the full contents of the bucket on the second go around. However, the spin lock on hslot2->lock is released before doing this. Would it not be more accurate to hold onto the lock until after the second attempt, so we know the size isn't changing between the time where we release the lock and the time when we reacquire it post-batch-resize. The bucket size would have to grow by more than 1.5x for the new size to be insufficient, so I may be splitting hairs here, but just something I noticed. But yeah, iterators that also print/dump are the main concern. > One thought is to avoid seq->stop() which will require to do batching again next > time, in particular, when the user has provided large buf to read() to ensure it > is large enough for one bucket. May be we can return whatever seq->buf has to > the userspace whenever a bucket/batch is done. This will have perf tradeoff > though and not sure how the userspace can optin. Hmmm, not sure if I understand here. As you say below, don't we have to use stop to deref the sk between reads? > Another random thought is in seq->stop (bpf_iter_{tcp,udp}_seq_stop). It has to > release the sk refcnt because we don't know when will the userspace come back to > read(). Will it be useful if it stores the cookies of the sk(s) that have not > yet seq->show? Keeping track of all the cookies we've seen or haven't seen in the current bucket would indeed allow us to avoid skipping any we haven't seen or repeating those we have on subsequent reads. I had considered something like this, but had initially avoided it, since I didn't want to dynamically allocate (and reallocate) additional memory to keep track of cookies. I also wasn't sure if this would be acceptable performance-wise, and of course the allocation can fail in which case you're back to square one. Although, I may be imagining a different implementation than you. In fact, this line of thinking led me to the approach proposed in the RFC which basically tries to accurately/efficiently track everything that's already seen without remembering all the cookies or allocating any additional buffers. This might make a good alternative if the concerns I listed aren't a big deal. Thanks! -Jordan