Hi Kumar, On Thu, Jul 28, 2022 at 10:53 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: > > On Thu, 28 Jul 2022 at 19:08, Yonghong Song <yhs@xxxxxx> wrote: > > > [...] > > > > There is one problem here. The current pidfd_open syscall > > only supports thread-group leader, i.e., main thread, i.e., > > it won't support any non-main-thread tid's. Yes, thread-group > > leader and other threads should share the same vma and files > > in most of times, but it still possible different threads > > in the same process may have different files which is why > > in current task_iter.c we have: > > *tid = pid_nr_ns(pid, ns); > > task = get_pid_task(pid, PIDTYPE_PID); > > if (!task) { > > ++*tid; > > goto retry; > > } else if (skip_if_dup_files && > > !thread_group_leader(task) && > > task->files == task->group_leader->files) { > > put_task_struct(task); > > task = NULL; > > ++*tid; > > goto retry; > > } > > > > > > Each thread (tid) will have some fields different from > > thread-group leader (tgid), e.g., comm and most (if not all) > > scheduling related fields. > > > > So it would be good to support for each tid from the start > > as it is not clear when pidfd_open will support non > > thread-group leader. > > I think this is just a missing feature, not a design limitation. If we > just disable thread pifdfd from waitid and pidfd_send_signal, I think > it is ok to use it elsewhere. > > > > > If it worries wrap around, a reference to the task > > can be held when tid passed to the kernel at link > > create time. This is similar to pid is passed to > > the kernel at pidfd_open syscall. But in practice, > > I think taking the reference during read() should > > also fine. The race always exist anyway. > > > > Kumar, could you give more details about security > > concerns? I am not sure about the tight relationship > > between bpf_iter and security. bpf_iter just for > > iterating kernel data structures. > > > > There is no security 'concern' per se. But having an fd which is used > to set up the iterator just gives a chance to a BPF LSM to easily > isolate permissions to attach the iterator to a task represented by > that fd. i.e. the fd is an object to which this permission can be > bound, the fd can be passed around (to share the same permission with > others but it is limited to the task corresponding to it), etc. The > permission management is even easier and faster once you have a file > local storage map (which I plan to send soon). > I like the idea of a file local storage map. I have several questions in mind, but don't want to digress from the discussion under Kui-Feng's patchset. It probably will be clear when seeing your change. Please cc me when you send it out. thanks. > So you could have two pidfds, one which allows the process to attach > the iter to it, and one which doesn't, without relying on the task's > capabilities, the checks can become more fine grained, and the > original task can even drop its capabilities (because they are bound > to the fd instead), which allows privilege separation. > > It becomes a bit difficult when kernel APIs take IDs because then you > don't have any subject (other than the task) to draw the permissions > from. > > But anyway, all of this was just a suggestion (which is why I > solicited opinions in the original reply). Feel free to drop/ignore if > it is too much of a hassle to support (i.e. it is understandable you > wouldn't want to spend time extending pidfd_open for this). On another thread, I was having a discussion with Tejun on FD vs ID for cgroup_iter. I am in favor of ID in general, because it's stateless and matches the info reported by bpf_link_info. This is nice from the userspace's perspective. Hao