On Wed, Feb 05, 2025 at 12:49:30PM +0100, Christian Brauner wrote: > On Tue, Feb 04, 2025 at 05:05:06PM +0100, Paolo Bonzini wrote: > > On Tue, Feb 4, 2025 at 3:19 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > > > On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote: > > > > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > > > > On 01/26, Linus Torvalds wrote: > > > > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > > > > > case, next_tid() can just check same_thread_group, > > > > > > > > > > > > That was my thinking yes. > > > > > > > > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > > > > > hide it from some management tool and be used for nefarious purposes > > > > > > > > > > Agreed, > > > > > > > > > > > (even if they then show up elsewhere that the tool wouldn't look at). > > > > > > > > > > Even if we move them from /proc/*/task to /proc ? > > > > > > > > Indeed---as long as they show up somewhere, it's not worse than it > > > > used to be. The reason why I'd prefer them to stay in /proc/*/task is > > > > that moving them away at least partly negates the benefits of the > > > > "workers are children of the starter" model. For example it > > > > complicates measuring their cost within the process that runs the VM. > > > > Maybe it's more of a romantic thing than a real practical issue, > > > > because in the real world resource accounting for VMs is done via > > > > cgroups. But unlike the lazy creation in KVM, which is overall pretty > > > > self-contained, I am afraid the ugliness in procfs would be much worse > > > > compared to the benefit, if there's a benefit at all. > > > > > > > > > Perhaps, I honestly do not know what will/can confuse userspace more. > > > > > > > > At the very least, marking workers as "Kthread: 1" makes sense and > > You mean in /proc/<pid>/status? Yeah, we can do that. This expands the > definition of Kthread a bit. It would now mean anything that the kernel > spawned for userspace. But that is probably fine. > > But it won't help with the problem of just checking /proc/<pid>/task/ to > figure out whether the caller is single-threaded or not. If the caller > has more than 1 entry in there they need to walk through all of them and > parse through /proc/<pid>/status to discount them if they're kernel > threads. > > > > > should not cause too much confusion. I wouldn't go beyond that unless > > > > we get more reports of similar issues, and I'm not even sure how > > > > common it is for userspace libraries to check for single-threadedness. > > > > > > Sorry, just saw this thread now. > > > > > > What if we did what Linus suggests and hide (odd) user workers from > > > /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter > > > would only list the workers that got spawned by the kernel for that > > > particular task? This would acknowledge their somewhat special status > > > and allow userspace to still detect them as "Hey, I didn't actually > > > spawn those, they got shoved into my workload by the kernel for me.". > > > > Wouldn't the workers then disappear completely from ps, top or other > > tools that look at /proc/$PID/task? That seems a bit too underhanded > > towards userspace... > > So maybe, but then there's also the possibility to do: > > - Have /proc/<pid>/status list all tasks. > - Have /proc/<pid>/worker only list user workers spawned by the kernel for userspace. > > count(/proc/<pid>/status) - count(/proc/<pid>/workers) == 1 => (userspace) single threaded > > My wider point is that I would prefer we add something that is > consistent and doesn't have to give the caller a different view than a > foreign task. I think that will just create confusion in the long run. So what I had in mind was (quickly sketched) the rough draft below. This will unconditionally skip PF_USER_WORKER tasks in /proc/<pid>/task and will only list them in /proc/<pid>/worker. fs/proc/base.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 116 insertions(+), 6 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index cd89e956c322..60e6b2cea259 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3315,10 +3315,13 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, * Thread groups */ static const struct file_operations proc_task_operations; +static const struct file_operations proc_worker_operations; static const struct inode_operations proc_task_inode_operations; +static const struct inode_operations proc_worker_inode_operations; static const struct pid_entry tgid_base_stuff[] = { DIR("task", S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations), + DIR("worker", S_IRUGO|S_IXUGO, proc_worker_inode_operations, proc_worker_operations), DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations), DIR("fdinfo", S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations), @@ -3835,11 +3838,14 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry fs_info = proc_sb_info(dentry->d_sb); ns = fs_info->pid_ns; - rcu_read_lock(); - task = find_task_by_pid_ns(tid, ns); - if (task) - get_task_struct(task); - rcu_read_unlock(); + scoped_guard(rcu) { + task = find_task_by_pid_ns(tid, ns); + if (task) { + if (task->flags & PF_USER_WORKER) + goto out; + get_task_struct(task); + } + } if (!task) goto out; if (!same_thread_group(leader, task)) @@ -3949,7 +3955,7 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx) tid = (int)(intptr_t)file->private_data; file->private_data = NULL; for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns); - task; + task && !(task->flags & PF_USER_WORKER); task = next_tid(task), ctx->pos++) { char name[10 + 1]; unsigned int len; @@ -3987,6 +3993,97 @@ static int proc_task_getattr(struct mnt_idmap *idmap, return 0; } +static struct dentry * +proc_worker_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) +{ + struct task_struct *task; + struct task_struct *leader = get_proc_task(dir); + unsigned tid; + struct proc_fs_info *fs_info; + struct pid_namespace *ns; + struct dentry *result = ERR_PTR(-ENOENT); + + if (!leader) + goto out_no_task; + + tid = name_to_int(&dentry->d_name); + if (tid == ~0U) + goto out; + + fs_info = proc_sb_info(dentry->d_sb); + ns = fs_info->pid_ns; + scoped_guard(rcu) { + task = find_task_by_pid_ns(tid, ns); + if (task) { + if (!(task->flags & PF_USER_WORKER)) + goto out; + get_task_struct(task); + } + } + if (!task) + goto out; + if (!same_thread_group(leader, task)) + goto out_drop_task; + + result = proc_task_instantiate(dentry, task, NULL); +out_drop_task: + put_task_struct(task); +out: + put_task_struct(leader); +out_no_task: + return result; +} + +static int proc_worker_getattr(struct mnt_idmap *idmap, const struct path *path, + struct kstat *stat, u32 request_mask, + unsigned int query_flags) +{ + generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(path->dentry), stat); + return 0; +} + +static int proc_worker_readdir(struct file *file, struct dir_context *ctx) +{ + struct inode *inode = file_inode(file); + struct task_struct *task; + struct pid_namespace *ns; + int tid; + + if (proc_inode_is_dead(inode)) + return -ENOENT; + + if (!dir_emit_dots(file, ctx)) + return 0; + + /* We cache the tgid value that the last readdir call couldn't + * return and lseek resets it to 0. + */ + ns = proc_pid_ns(inode->i_sb); + tid = (int)(intptr_t)file->private_data; + file->private_data = NULL; + for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns); + task && (task->flags & PF_USER_WORKER); + task = next_tid(task), ctx->pos++) { + char name[10 + 1]; + unsigned int len; + + tid = task_pid_nr_ns(task, ns); + if (!tid) + continue; /* The task has just exited. */ + len = snprintf(name, sizeof(name), "%u", tid); + if (!proc_fill_cache(file, ctx, name, len, + proc_task_instantiate, task, NULL)) { + /* returning this tgid failed, save it as the first + * pid for the next readir call */ + file->private_data = (void *)(intptr_t)tid; + put_task_struct(task); + break; + } + } + + return 0; +} + /* * proc_task_readdir() set @file->private_data to a positive integer * value, so casting that to u64 is safe. generic_llseek_cookie() will @@ -4005,6 +4102,19 @@ static loff_t proc_dir_llseek(struct file *file, loff_t offset, int whence) return off; } +static const struct inode_operations proc_worker_inode_operations = { + .lookup = proc_worker_lookup, + .getattr = proc_worker_getattr, + .setattr = proc_setattr, + .permission = proc_pid_permission, +}; + +static const struct file_operations proc_worker_operations = { + .read = generic_read_dir, + .iterate_shared = proc_worker_readdir, + .llseek = proc_dir_llseek, +}; + static const struct inode_operations proc_task_inode_operations = { .lookup = proc_task_lookup, .getattr = proc_task_getattr, -- 2.47.2