On Mon, Jun 03, 2024 at 04:46:31PM +0200, Benjamin Tissoires wrote: > > Kernel-side descriptors should be used only for marshalling - they can > > be passed by userland (and should be resolved to struct file *, with > > no expectations that repeated call of fget() would yield the same > > pointer) and they can be returned _to_ userland - after you've allocated > > them and associated them with struct file *. > > I think the situation is not as bad as you think it is. > First these fds are not actually struct file * but bpf objects. So the > users of them should be rather small. Threading in this case is probably > fine because this only happens under a mutex lock. First of all, _any_ of those is actually a struct file * - it's just that initially you have a reference to opened bpffs file there. As soon as it's in the table, it can be replaced with _any_ reference to opened file. And no, your mutex doesn't matter here - dup2() neither knows nor cares about it. And that's all you need to replace a descriptor table entry - dup2() from another thread that happens to share descriptor table with the one you are currently running in. > And last, and I think that's the biggest point here, we are not really > "in the kernel". The code is executed in the kernel, but it's coming > from a syscall bpf program, and as such we are in a blurry state between > userspace and kernel space. In context of which thread is it executed? IOW, what does current point to when your code runs? > Then, the bpf API uses fds because we are > effectively going through the userspace API from within the kernel to > populate the maps. At any point here we don't assume the fd is safe. > Whenever the call to skel_map_update_elem() is done, the kernel emits > an equivalent of a syscall to this kernel function, and as such the > actual kernel code behind it treats everything like a kernel code should. It's not about the isolated accesses of descriptor table being unsafe; it's about the assumption that references you've put into descriptor table will not be replaced by another thread, both sides using perfectly sound operations for each access. Again, descriptor table should be treated as a shared data structure. There are some exceptions (e.g. in execve() past the point of no return, after it has explicitly unshared the descriptor table), but none of those applies in your case. There's no such thing as "lock descriptor table against modifications" available outside of fs/file.c; moreover, the thing used inside fs/file.c to protect individual table modifications is a spinlock and it really can't be held over blocking operations. So the descriptor table itself will be fine, but the references you store in there might be replaced at any point. > Efectively, this use of fds is to pass information from/to userland :) "Here's a new descriptor, feel free to do IO on it" is fine - if another thread starts playing silly buggers with descriptor table (e.g. closes that descriptor behind the first thread's back), it's a userland bug and the kernel is fine. Two threads sharing descriptor table are likely to share memory as well, so they can step on each others' toes in a lot of other ways. Your situation is different - you have kernel-side code putting stuff into descriptor table and expecting the same thing to be found at the same place later on. _That_ is a trouble waiting to happen. > But I agree, the whole logic of this file is weird, and not clean. > > Luckily, with the help of the bpf folks, I came to a better solution, > with bpf_struct_ops[0]. Hadn't looked at it yet...