Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Thu, 4 Jan 2024 16:33:04 -0800

On 1/4/24 3:38 PM, Aditi Ghag wrote:
I'm not sure about semantics of the resume operation for certain corner cases like these:
- The BPF UDP sockets iterator was stopped while iterating bucker #X, and the offset was set to 2. bpf_iter_udp_seq_stop then released references to the batched sockets, and marks the bucket X iterator state (aka iter->st_bucket_done) as false.
- Before the iterator is "resumed", the bucket #X was mutated such that the previously iterated sockets were removed, and new sockets were added.  With the current logic, the iterator will skip the first two sockets in the bucket, which isn't right. This is slightly different from the case where sockets were updated in the X -1 bucket *after* it was fully iterated. Since the bucket and sock locks are released, we don't have any guarantees that the underlying sockets table isn't mutated while the userspace has a valid iterator.
What do you think about such cases?
I believe it is something orthogonal to the bug fix here but we could use this thread to discuss.

Yes, indeed! But I piggy-backed on the same thread, as one potential option could be to always start iterating from the beginning of a bucket. (More details below.)

This is not something specific to the bpf tcp/udp iter which uses the offset as a best effort to resume (e.g. the inet_diag and the /proc/net/{tcp[6],udp} are using similar strategy to resume). To improve it, it will need to introduce some synchronization with the (potentially fast path) writer side (e.g. bind, after 3WHS...etc). Not convinced it is worth it to catch these cases.

Right, synchronizing fast paths with the iterator logic seems like an overkill.

If we change the resume semantics, and make the iterator always start from the beginning of a bucket, it could solve some of these corner cases (and simplify the batching logic). The last I checked, the TCP (BPF) iterator logic was tightly coupled with the 

Always resume from the beginning of the bucket? hmm... then it is making 
backward progress and will hit the same bug again. or I miss-understood your 
proposal?

file based iterator (/proc/net/{tcp,udp}), so I'm not sure if it's an easy change if we were to change the resume semantics for both TCP and UDP BFP iterators?
Note that, this behavior would be similar to the lseek operation with seq_file [1]. Here is a snippet -

bpf_iter does not support lseek.

The stop() function closes a session; its job, of course, is to clean up. If dynamic memory is allocated for the iterator, stop() is the place to free it; if a lock was taken by start(), stop() must release that lock. The value that *pos was set to by the last next() call before stop() is remembered, and used for the first start() call of the next session unless lseek() has been called on the file; in that case next start() will be asked to start at position zero

[1] https://docs.kernel.org/filesystems/seq_file.html

For the cases described above, skipped the newer sockets is arguably ok. These two new sockets will not be captured anyway even the batch was not stop()-ed in the middle. I also don't see how it is different semantically if the two new sockets are added to the X-1 bucket: the sockets are added after the bpf-iter scanned it regardless they are added to an earlier bucket or to an earlier location of the same bucket.

That said, the bpf_iter_udp_seq_stop() should only happen if the bpf_prog bpf_seq_printf() something AND hit the seq->buf (PAGE_SIZE) << 3) limit or the count in "read(iter_fd, buf, count)" limit.

Thanks for sharing the additional context. Would you have a link for these set of conditions where an iterator can be stopped? It'll be good to document the API semantics so that users are aware of the implications of setting the read size parameter too low.

Most of the conditions should be in bpf_seq_read() in bpf_iter.c.

Again, this resume on offset behavior is not something specific to 
bpf-{tcp,udp}-iter.

For this case, bpf_iter.c may be improved to capture the whole batch's seq_printf() to seq->buf first even the userspace's buf is full. It would be a separate effort if it is indeed needed.

Interesting proposal... Other option could be to invalidate the userspace iterator if an entire bucket batch is not being captured, so that userspace can retry with a bigger buffer.

Not sure how to invalidate the user buffer without breaking the existing 
userspace app though.

The earlier idea on seq->buf was a quick thought. I suspect there is still 
things that need to think through if we did want to explore how to provide 
better guarantee to allow seq_printf() for one whole batch. I still feel it is 
overkill.