Re: [GIT PULL] KVM changes for Linux 6.14

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 26 Feb 2025 13:14:32 +0100

On Wed, Feb 05, 2025 at 12:49:30PM +0100, Christian Brauner wrote:
> On Tue, Feb 04, 2025 at 05:05:06PM +0100, Paolo Bonzini wrote:
> > On Tue, Feb 4, 2025 at 3:19 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote:
> > > > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> > > > > On 01/26, Linus Torvalds wrote:
> > > > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/
> > > > > > > case, next_tid() can just check same_thread_group,
> > > > > >
> > > > > > That was my thinking yes.
> > > > > >
> > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would
> > > > > > hide it from some management tool and be used for nefarious purposes
> > > > >
> > > > > Agreed,
> > > > >
> > > > > > (even if they then show up elsewhere that the tool wouldn't look at).
> > > > >
> > > > > Even if we move them from /proc/*/task to /proc ?
> > > >
> > > > Indeed---as long as they show up somewhere, it's not worse than it
> > > > used to be. The reason why I'd prefer them to stay in /proc/*/task is
> > > > that moving them away at least partly negates the benefits of the
> > > > "workers are children of the starter" model. For example it
> > > > complicates measuring their cost within the process that runs the VM.
> > > > Maybe it's more of a romantic thing than a real practical issue,
> > > > because in the real world resource accounting for VMs is done via
> > > > cgroups. But unlike the lazy creation in KVM, which is overall pretty
> > > > self-contained, I am afraid the ugliness in procfs would be much worse
> > > > compared to the benefit, if there's a benefit at all.
> > > >
> > > > > Perhaps, I honestly do not know what will/can confuse userspace more.
> > > >
> > > > At the very least, marking workers as "Kthread: 1" makes sense and
> 
> You mean in /proc/<pid>/status? Yeah, we can do that. This expands the
> definition of Kthread a bit. It would now mean anything that the kernel
> spawned for userspace. But that is probably fine.
> 
> But it won't help with the problem of just checking /proc/<pid>/task/ to
> figure out whether the caller is single-threaded or not. If the caller
> has more than 1 entry in there they need to walk through all of them and
> parse through /proc/<pid>/status to discount them if they're kernel
> threads.
> 
> > > > should not cause too much confusion. I wouldn't go beyond that unless
> > > > we get more reports of similar issues, and I'm not even sure how
> > > > common it is for userspace libraries to check for single-threadedness.
> > >
> > > Sorry, just saw this thread now.
> > >
> > > What if we did what Linus suggests and hide (odd) user workers from
> > > /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter
> > > would only list the workers that got spawned by the kernel for that
> > > particular task? This would acknowledge their somewhat special status
> > > and allow userspace to still detect them as "Hey, I didn't actually
> > > spawn those, they got shoved into my workload by the kernel for me.".
> > 
> > Wouldn't the workers then disappear completely from ps, top or other
> > tools that look at /proc/$PID/task? That seems a bit too underhanded
> > towards userspace...
> 
> So maybe, but then there's also the possibility to do:
> 
> - Have /proc/<pid>/status list all tasks.
> - Have /proc/<pid>/worker only list user workers spawned by the kernel for userspace.
> 
> count(/proc/<pid>/status) - count(/proc/<pid>/workers) == 1 => (userspace) single threaded
> 
> My wider point is that I would prefer we add something that is
> consistent and doesn't have to give the caller a different view than a
> foreign task. I think that will just create confusion in the long run.

So what I had in mind was (quickly sketched) the rough draft below. This
will unconditionally skip PF_USER_WORKER tasks in /proc/<pid>/task and
will only list them in /proc/<pid>/worker.

 fs/proc/base.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 116 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index cd89e956c322..60e6b2cea259 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3315,10 +3315,13 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
  * Thread groups
  */
 static const struct file_operations proc_task_operations;
+static const struct file_operations proc_worker_operations;
 static const struct inode_operations proc_task_inode_operations;
+static const struct inode_operations proc_worker_inode_operations;
 
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
+	DIR("worker",     S_IRUGO|S_IXUGO, proc_worker_inode_operations, proc_worker_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations),
@@ -3835,11 +3838,14 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 
 	fs_info = proc_sb_info(dentry->d_sb);
 	ns = fs_info->pid_ns;
-	rcu_read_lock();
-	task = find_task_by_pid_ns(tid, ns);
-	if (task)
-		get_task_struct(task);
-	rcu_read_unlock();
+	scoped_guard(rcu) {
+		task = find_task_by_pid_ns(tid, ns);
+		if (task) {
+			if (task->flags & PF_USER_WORKER)
+				goto out;
+			get_task_struct(task);
+		}
+	}
 	if (!task)
 		goto out;
 	if (!same_thread_group(leader, task))
@@ -3949,7 +3955,7 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
 	tid = (int)(intptr_t)file->private_data;
 	file->private_data = NULL;
 	for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
-	     task;
+	     task && !(task->flags & PF_USER_WORKER);
 	     task = next_tid(task), ctx->pos++) {
 		char name[10 + 1];
 		unsigned int len;
@@ -3987,6 +3993,97 @@ static int proc_task_getattr(struct mnt_idmap *idmap,
 	return 0;
 }
 
+static struct dentry *
+proc_worker_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+	struct task_struct *task;
+	struct task_struct *leader = get_proc_task(dir);
+	unsigned tid;
+	struct proc_fs_info *fs_info;
+	struct pid_namespace *ns;
+	struct dentry *result = ERR_PTR(-ENOENT);
+
+	if (!leader)
+		goto out_no_task;
+
+	tid = name_to_int(&dentry->d_name);
+	if (tid == ~0U)
+		goto out;
+
+	fs_info = proc_sb_info(dentry->d_sb);
+	ns = fs_info->pid_ns;
+	scoped_guard(rcu) {
+		task = find_task_by_pid_ns(tid, ns);
+		if (task) {
+			if (!(task->flags & PF_USER_WORKER))
+				goto out;
+			get_task_struct(task);
+		}
+	}
+	if (!task)
+		goto out;
+	if (!same_thread_group(leader, task))
+		goto out_drop_task;
+
+	result = proc_task_instantiate(dentry, task, NULL);
+out_drop_task:
+	put_task_struct(task);
+out:
+	put_task_struct(leader);
+out_no_task:
+	return result;
+}
+
+static int proc_worker_getattr(struct mnt_idmap *idmap, const struct path *path,
+			       struct kstat *stat, u32 request_mask,
+			       unsigned int query_flags)
+{
+	generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(path->dentry), stat);
+	return 0;
+}
+
+static int proc_worker_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct task_struct *task;
+	struct pid_namespace *ns;
+	int tid;
+
+	if (proc_inode_is_dead(inode))
+		return -ENOENT;
+
+	if (!dir_emit_dots(file, ctx))
+		return 0;
+
+	/* We cache the tgid value that the last readdir call couldn't
+	 * return and lseek resets it to 0.
+	 */
+	ns = proc_pid_ns(inode->i_sb);
+	tid = (int)(intptr_t)file->private_data;
+	file->private_data = NULL;
+	for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
+	     task && (task->flags & PF_USER_WORKER);
+	     task = next_tid(task), ctx->pos++) {
+		char name[10 + 1];
+		unsigned int len;
+
+		tid = task_pid_nr_ns(task, ns);
+		if (!tid)
+			continue;	/* The task has just exited. */
+		len = snprintf(name, sizeof(name), "%u", tid);
+		if (!proc_fill_cache(file, ctx, name, len,
+				proc_task_instantiate, task, NULL)) {
+			/* returning this tgid failed, save it as the first
+			 * pid for the next readir call */
+			file->private_data = (void *)(intptr_t)tid;
+			put_task_struct(task);
+			break;
+		}
+	}
+
+	return 0;
+}
+
 /*
  * proc_task_readdir() set @file->private_data to a positive integer
  * value, so casting that to u64 is safe. generic_llseek_cookie() will
@@ -4005,6 +4102,19 @@ static loff_t proc_dir_llseek(struct file *file, loff_t offset, int whence)
 	return off;
 }
 
+static const struct inode_operations proc_worker_inode_operations = {
+	.lookup		= proc_worker_lookup,
+	.getattr	= proc_worker_getattr,
+	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
+};
+
+static const struct file_operations proc_worker_operations = {
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_worker_readdir,
+	.llseek		= proc_dir_llseek,
+};
+
 static const struct inode_operations proc_task_inode_operations = {
 	.lookup		= proc_task_lookup,
 	.getattr	= proc_task_getattr,
-- 
2.47.2