On Thu, 28 Jul 2022 at 19:08, Yonghong Song <yhs@xxxxxx> wrote: > > [...] > > There is one problem here. The current pidfd_open syscall > only supports thread-group leader, i.e., main thread, i.e., > it won't support any non-main-thread tid's. Yes, thread-group > leader and other threads should share the same vma and files > in most of times, but it still possible different threads > in the same process may have different files which is why > in current task_iter.c we have: > *tid = pid_nr_ns(pid, ns); > task = get_pid_task(pid, PIDTYPE_PID); > if (!task) { > ++*tid; > goto retry; > } else if (skip_if_dup_files && > !thread_group_leader(task) && > task->files == task->group_leader->files) { > put_task_struct(task); > task = NULL; > ++*tid; > goto retry; > } > > > Each thread (tid) will have some fields different from > thread-group leader (tgid), e.g., comm and most (if not all) > scheduling related fields. > > So it would be good to support for each tid from the start > as it is not clear when pidfd_open will support non > thread-group leader. I think this is just a missing feature, not a design limitation. If we just disable thread pifdfd from waitid and pidfd_send_signal, I think it is ok to use it elsewhere. > > If it worries wrap around, a reference to the task > can be held when tid passed to the kernel at link > create time. This is similar to pid is passed to > the kernel at pidfd_open syscall. But in practice, > I think taking the reference during read() should > also fine. The race always exist anyway. > > Kumar, could you give more details about security > concerns? I am not sure about the tight relationship > between bpf_iter and security. bpf_iter just for > iterating kernel data structures. > There is no security 'concern' per se. But having an fd which is used to set up the iterator just gives a chance to a BPF LSM to easily isolate permissions to attach the iterator to a task represented by that fd. i.e. the fd is an object to which this permission can be bound, the fd can be passed around (to share the same permission with others but it is limited to the task corresponding to it), etc. The permission management is even easier and faster once you have a file local storage map (which I plan to send soon). So you could have two pidfds, one which allows the process to attach the iter to it, and one which doesn't, without relying on the task's capabilities, the checks can become more fine grained, and the original task can even drop its capabilities (because they are bound to the fd instead), which allows privilege separation. It becomes a bit difficult when kernel APIs take IDs because then you don't have any subject (other than the task) to draw the permissions from. But anyway, all of this was just a suggestion (which is why I solicited opinions in the original reply). Feel free to drop/ignore if it is too much of a hassle to support (i.e. it is understandable you wouldn't want to spend time extending pidfd_open for this).