Re: [PATCH bpf-next 1/3] bpf: Parameterize task iterators.

Yonghong Song <yhs@xxxxxx> · Thu, 28 Jul 2022 15:54:01 -0700

On 7/28/22 10:52 AM, Kumar Kartikeya Dwivedi wrote:
On Thu, 28 Jul 2022 at 19:08, Yonghong Song <yhs@xxxxxx> wrote:
[...]

There is one problem here. The current pidfd_open syscall
only supports thread-group leader, i.e., main thread, i.e.,
it won't support any non-main-thread tid's. Yes, thread-group
leader and other threads should share the same vma and files
in most of times, but it still possible different threads
in the same process may have different files which is why
in current task_iter.c we have:
                  *tid = pid_nr_ns(pid, ns);
                  task = get_pid_task(pid, PIDTYPE_PID);
                  if (!task) {
                          ++*tid;
                          goto retry;
                  } else if (skip_if_dup_files &&
!thread_group_leader(task) &&
                             task->files == task->group_leader->files) {
                          put_task_struct(task);
                          task = NULL;
                          ++*tid;
                          goto retry;
                  }

Each thread (tid) will have some fields different from
thread-group leader (tgid), e.g., comm and most (if not all)
scheduling related fields.

So it would be good to support for each tid from the start
as it is not clear when pidfd_open will support non
thread-group leader.

I think this is just a missing feature, not a design limitation. If we
just disable thread pifdfd from waitid and pidfd_send_signal, I think
it is ok to use it elsewhere.

Yes, I agree this is a missing feature. It is desirable
that the missing feature gets implemented in kernel or
at least promising work is recognized before we use pidfd
for this task structure.

If it worries wrap around, a reference to the task
can be held when tid passed to the kernel at link
create time. This is similar to pid is passed to
the kernel at pidfd_open syscall. But in practice,
I think taking the reference during read() should
also fine. The race always exist anyway.

Kumar, could you give more details about security
concerns? I am not sure about the tight relationship
between bpf_iter and security. bpf_iter just for
iterating kernel data structures.

There is no security 'concern' per se. But having an fd which is used
to set up the iterator just gives a chance to a BPF LSM to easily
isolate permissions to attach the iterator to a task represented by
that fd. i.e. the fd is an object to which this permission can be
bound, the fd can be passed around (to share the same permission with
others but it is limited to the task corresponding to it), etc. The
permission management is even easier and faster once you have a file
local storage map (which I plan to send soon).

So you mean with fd, bpf_lsm could add a hook in bpf iterator
to enforce policies about whether a particular task can be visited
or not, right? In such cases, I think the policy should be
against a task pointer and a bpf pointer, which have enough information
for the policy to work.

In typical use cases, user space gets a pid (the process own pid or
another process id). Yes, we could create a pidfd with pidfd_open().
And this is pidfd can be manipulated with permissions, and passed
to the bpf iter. The current pidfd creation looks like below:

int pidfd_create(struct pid *pid, unsigned int flags)
{
        int fd;

        if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
                return -EINVAL;

        if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC))
                return -EINVAL;

        fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
                              flags | O_RDWR | O_CLOEXEC);
        if (fd < 0)
                put_pid(pid);

        return fd;
}

I am not sure how security policy can be easily applied to such a
fd. Probably user need to further manipulate fd with fcntl() and I think
most users probably won't do that.

The following are some bpf program lsm hooks:

#ifdef CONFIG_BPF_SYSCALL
LSM_HOOK(int, 0, bpf, int cmd, union bpf_attr *attr, unsigned int size)
LSM_HOOK(int, 0, bpf_map, struct bpf_map *map, fmode_t fmode)
LSM_HOOK(int, 0, bpf_prog, struct bpf_prog *prog)
LSM_HOOK(int, 0, bpf_map_alloc_security, struct bpf_map *map)
LSM_HOOK(void, LSM_RET_VOID, bpf_map_free_security, struct bpf_map *map)
LSM_HOOK(int, 0, bpf_prog_alloc_security, struct bpf_prog_aux *aux)
LSM_HOOK(void, LSM_RET_VOID, bpf_prog_free_security, struct bpf_prog_aux 
*aux)
#endif /* CONFIG_BPF_SYSCALL */

I think task_struct ptr and prog ptr should be enough for the
potential LSM hook.

So you could have two pidfds, one which allows the process to attach
the iter to it, and one which doesn't, without relying on the task's
capabilities, the checks can become more fine grained, and the
original task can even drop its capabilities (because they are bound
to the fd instead), which allows privilege separation.

It becomes a bit difficult when kernel APIs take IDs because then you
don't have any subject (other than the task) to draw the permissions
from.

Maybe I missed something from security perspective. But from design
perspective, I am okay with pidfd since it pushes the responsibility
of pid -> task_struct conversion to user space. Although unlikely,
it still has chances that inside the kernel tid->task_struct may
actually use wrapped around tid...

But since pidfd_open misses tid support, I hesitate to use pidfd
for now.

Maybe Daniel or Alexei can comment as well?

But anyway, all of this was just a suggestion (which is why I
solicited opinions in the original reply). Feel free to drop/ignore if
it is too much of a hassle to support (i.e. it is understandable you
wouldn't want to spend time extending pidfd_open for this).