On Wed, Jun 01, 2022 at 09:45:15AM -0700, Linus Torvalds wrote: > On Wed, Jun 1, 2022 at 6:20 AM Alexey Gladkov <legion@xxxxxxxxxx> wrote: > > > > Dynamic memory allocation is needed to modify .data and specify the per > > namespace parameter. The new sysctl API is allowed to get rid of the > > need for such modification. > > Ok, this is looking better. That said, a few comments: > > > > > diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c > > index ef313ecfb53a..833b670c38f3 100644 > > --- a/ipc/ipc_sysctl.c > > +++ b/ipc/ipc_sysctl.c > > @@ -68,26 +68,94 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write, > > return ret; > > } > > > > +static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table); > > + > > +static int ipc_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file) > > +{ > > + struct ipc_namespace *ns = current->nsproxy->ipc_ns; > > + > > + // For now, we only allow changes in init_user_ns. > > + if (ns->user_ns != &init_user_ns) > > + return -EPERM; > > + > > +#ifdef CONFIG_CHECKPOINT_RESTORE > > + int index = (ctx->table - ipc_sysctls); > > + > > + switch (index) { > > + case IPC_SYSCTL_SEM_NEXT_ID: > > + case IPC_SYSCTL_MSG_NEXT_ID https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/cgroup.c#n1392: > [...] > > I don't think you actually even compile-tested this, because you're > using these IPC_SYSCTL_SEM_NEXT_ID etc enums before you even declared > them later in the same file. I did it but without CONFIG_CHECKPOINT_RESTORE. This is where I'm not sure who can write to ipc sysctls inside ipc_namespace. > > +static ssize_t ipc_sys_read(struct ctl_context *ctx, struct file *file, > > + char *buffer, size_t *lenp, loff_t *ppos) > > +{ > > + struct ctl_table table = *ctx->table; > > + table.data = data_from_ns(ctx, ctx->table); > > + return table.proc_handler(&table, 0, buffer, lenp, ppos); > > +} > > Can we please fix the names and the types of this new 'ctx' structure? > > Yes, yes, I know the old legacy "sysctl table" is horribly named, and > uses "ctl_table". Sure. > But let's just write it out. It's not some random control table for > anything. It's a very odd and specific thing: "sysctl". Let's use the > full name. > > Also, Please just make that "ctl_data" member in that "ctl_context" > struct not just have a real name, but a real type. Make it literally > be > > struct ipc_namespace *ipc_ns; > > and if we end up with other things wanting other pointers, just add a > new one (or make a union if we care about the size of that allocation, > which I don't see any reason we'd do when it's literally just like a > couple of pointers in size). > > There is no reason to have some pseudo-generic "void *ctl_data" that > makes it ambiguous and allows for type confusion and isn't > self-documenting. I'd rather have a properly typed pointer that is > just initialized to NULL and is not always used or needed, but always > has a clear case for *what* it would be used for. > > Yes, yes, we have f_private etc for things that are really very very > generic and have arbitrary users. But 'sysctl' is not that kind of > truly generic use. Yep. I made ctl_data in the same way as f_private. My idea is that if someone needs to store more than one pointer, they can put a struct there. But it turned out that at least now, apart from ipc_namespace, nothing is needed. > I wish we didn't have that silly "create a temporary ctl_table entry" > either, and I wish it was properly named. But it's not worth the > pointless churn to fix old bad interfaces. But the new ones should > have better names, and try to avoid those bad old decisions. Currently temporary ctl_table is the main strategy for handling sysctl entries. Perhaps it will be possible to get rid of this if we add another get_data() that would return what is currently placed in .data in ctl_table. I mean make getting .data dynamic. > But yeah, I think this all is a step in the right direction. And maybe > some of those cases and old 'ctl_table' things can be migrated to just > using individual read() functions entirely. The whole 'ctl_table' > model was broken, and came from the bad old days with an actual > 'sysctl()' system call. I'm not sure how to get rid of ctl_table since net sysctls are heavily dependent on it. I was wondering if it's possible to get rid of ctl_table but if it's not possible to rewrite everything to some kind of new magic API, then keeping two of them would be a nightmare. Another problem is that ctl_table is being used by __cgroup_bpf_run_filter_sysctl. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/cgroup.c#n1392 > Because I think it would be lovely if people would move away from the > 'sysctl table' approach entirely for cases where that makes sense, and > these guys that already need special handling are very much in that > situation. Since you think that these patches are a step in the right direction, then I will prepare the first version with your comments in mind. -- Rgrds, legion