> On Jan 4, 2024, at 4:33 PM, Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 1/4/24 3:38 PM, Aditi Ghag wrote: >>>> I'm not sure about semantics of the resume operation for certain corner cases like these: >>>> - The BPF UDP sockets iterator was stopped while iterating bucker #X, and the offset was set to 2. bpf_iter_udp_seq_stop then released references to the batched sockets, and marks the bucket X iterator state (aka iter->st_bucket_done) as false. >>>> - Before the iterator is "resumed", the bucket #X was mutated such that the previously iterated sockets were removed, and new sockets were added. With the current logic, the iterator will skip the first two sockets in the bucket, which isn't right. This is slightly different from the case where sockets were updated in the X -1 bucket *after* it was fully iterated. Since the bucket and sock locks are released, we don't have any guarantees that the underlying sockets table isn't mutated while the userspace has a valid iterator. >>>> What do you think about such cases? >>> I believe it is something orthogonal to the bug fix here but we could use this thread to discuss. >> Yes, indeed! But I piggy-backed on the same thread, as one potential option could be to always start iterating from the beginning of a bucket. (More details below.) >>> >>> This is not something specific to the bpf tcp/udp iter which uses the offset as a best effort to resume (e.g. the inet_diag and the /proc/net/{tcp[6],udp} are using similar strategy to resume). To improve it, it will need to introduce some synchronization with the (potentially fast path) writer side (e.g. bind, after 3WHS...etc). Not convinced it is worth it to catch these cases. >> Right, synchronizing fast paths with the iterator logic seems like an overkill. >> If we change the resume semantics, and make the iterator always start from the beginning of a bucket, it could solve some of these corner cases (and simplify the batching logic). The last I checked, the TCP (BPF) iterator logic was tightly coupled with the > > Always resume from the beginning of the bucket? hmm... then it is making backward progress and will hit the same bug again. or I miss-understood your proposal? I presumed that the user would be required to pass a bigger buffer when seq_printf fails to capture the socket data being iterated (this was prior to when I wasn't aware of the logic that decided when to stop the sockets iterator). Thanks for the code pointer in your last message, so I'll expand on the proposal below. Also, we could continue to discuss if there are better ways to handle the cases where an iterator is stopped, but I would expect that we still need to fix the broken case in the current code, and get it backported. So I'll keep an eye out for your v2 patch. > >> file based iterator (/proc/net/{tcp,udp}), so I'm not sure if it's an easy change if we were to change the resume semantics for both TCP and UDP BFP iterators? >> Note that, this behavior would be similar to the lseek operation with seq_file [1]. Here is a snippet - > > bpf_iter does not support lseek. > >> The stop() function closes a session; its job, of course, is to clean up. If dynamic memory is allocated for the iterator, stop() is the place to free it; if a lock was taken by start(), stop() must release that lock. The value that *pos was set to by the last next() call before stop() is remembered, and used for the first start() call of the next session unless lseek() has been called on the file; in that case next start() will be asked to start at position zero >> [1] https://docs.kernel.org/filesystems/seq_file.html >>> >>> For the cases described above, skipped the newer sockets is arguably ok. These two new sockets will not be captured anyway even the batch was not stop()-ed in the middle. I also don't see how it is different semantically if the two new sockets are added to the X-1 bucket: the sockets are added after the bpf-iter scanned it regardless they are added to an earlier bucket or to an earlier location of the same bucket. >>> >>> That said, the bpf_iter_udp_seq_stop() should only happen if the bpf_prog bpf_seq_printf() something AND hit the seq->buf (PAGE_SIZE) << 3) limit or the count in "read(iter_fd, buf, count)" limit. >> Thanks for sharing the additional context. Would you have a link for these set of conditions where an iterator can be stopped? It'll be good to document the API semantics so that users are aware of the implications of setting the read size parameter too low. > > Most of the conditions should be in bpf_seq_read() in bpf_iter.c. Ah! This is helpful. > > Again, this resume on offset behavior is not something specific to bpf-{tcp,udp}-iter. > >>> For this case, bpf_iter.c may be improved to capture the whole batch's seq_printf() to seq->buf first even the userspace's buf is full. It would be a separate effort if it is indeed needed. >> Interesting proposal... Other option could be to invalidate the userspace iterator if an entire bucket batch is not being captured, so that userspace can retry with a bigger buffer. > > Not sure how to invalidate the user buffer without breaking the existing userspace app though. By "invalidate the user buffer", I meant passing an error code to the userspace app, so that the userspace can allocate a bigger buffer accordingly. (I wasn't aware if/of how this was being done behind the scenes in bpf_seq_read, so thanks for the pointer.) Based on my reading of the code, bpf_seq_read does seem to pass an error code when the passed buffer size isn't enough. When that happens, I would've expected the userspace iterator to be invalidated rather than resumed, and the BPF iterator program to be rerun with a larger buffer. > > The earlier idea on seq->buf was a quick thought. I suspect there is still things that need to think through if we did want to explore how to provide better guarantee to allow seq_printf() for one whole batch. I still feel it is overkill. I'm still trying to fully grasp the logic in bpf_seq_read, but it seems like it's a generic function for all BPF iterators (and not just BPF TCP/UDP iterator). *sigh* So if we wanted to simplify the resume case such that we didn't have to keep track of offsets within a batch, we would have to tailor the bpf_seq_read specifically for the batching logic in BPF TCP/UDP iterator (being able to fully capture batched sockets printf). That would indeed be a separate effort, and would need more discussion. One possible solution could be to handle "resume" operation in seq->buf without involving the BPF TCP/UDP handlers, but I haven't fully thought of this proposal. /cc Daniel