On 3/25/20 11:20 PM, Eric W. Biederman wrote: > Vlastimil Babka <vbabka@xxxxxxx> writes: >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -1980,6 +1980,68 @@ int __init sysctl_init(void) >> return 0; >> } >> >> +/* Set sysctl value passed on kernel command line. */ >> +int process_sysctl_arg(char *param, char *val, >> + const char *unused, void *arg) >> +{ >> + size_t count; >> + char *remaining; >> + int err; >> + loff_t ppos = 0; >> + struct ctl_table *ctl, *found = NULL; >> + >> + if (strncmp(param, "sysctl.", sizeof("sysctl.") - 1)) >> + return 0; > > Is there any way we can use a slash separated path. I know We could, but as others explained, people and tools are used to the dot separation, so I think the only sensible options are supporting only dot, or both dot and slash. > in practice there are not any sysctl names that don't have > a '.' in them but why should we artifically limit ourselves? Existing tools would probably break (or perhaps sysctl(8) is smarter than I think, dunno). > I guess as long as we don't mind not being able to set sysctls > that have a '.' in them it doesn't matter. Right. >> + >> + param += sizeof("sysctl.") - 1; >> + >> + remaining = param; >> + ctl = &sysctl_base_table[0]; >> + >> + while(ctl->procname != 0) { > ^^^^^^^^^^^^^^^^^^ > > Please either test "while(ctl->procname)" or > "while(ctl->procname != NULL)" testing against 0 makes it look like > procname is an integer. The style in the kernel is to test against > NULL, to make it clear when something is a pointer. OK >> + int len = strlen(ctl->procname); > > You should have done "strchr(remaining)" and figured out if there is > another '.' and only compared up to that dot. Probably skipping this > entry entirely if the two lengths don't match. That's also possible, but AFAICS my code works as intended, as I explained in a reply to Kees, and also below: >> + if (strncmp(remaining, ctl->procname, len)) { >> + ctl++; >> + continue; >> + } >> + if (ctl->child) { >> + if (remaining[len] == '.') { >> + remaining += len + 1; >> + ctl = ctl->child; >> + continue; >> + } >> + } else { >> + if (remaining[len] == '\0') { >> + found = ctl; >> + break; >> + } >> + } >> + ctl++; > > There should be exactly one match for a name a table. > If you get here the code should break, not continue on. If there existed e.g. both "vm.swap" and "vm.swappiness" options and user passed "vm.swappiness=10", but the "swap" ctl entry was encountered first, it will succeed the strncmp(), but then realize "swap" was just a prefix of what user specified (remaining[len] is not '\0') and hence continue serching for other matches. >> + } >> + >> + if (!found) { >> + pr_warn("Unknown sysctl param '%s' on command line", param); >> + return 0; >> + } >> + >> + if (!(found->mode & 0200)) { >> + pr_warn("Cannot set sysctl '%s=%s' from command line - not writable", >> + param, val); >> + return 0; >> + } >> + >> + count = strlen(val); >> + err = found->proc_handler(found, 1, val, &count, &ppos); >> + >> + if (err) >> + pr_warn("Error %d setting sysctl '%s=%s' from command line", >> + err, param, val); >> + >> + pr_debug("Set sysctl '%s=%s' from command line", param, val); >> + >> + return 0; >> +} > > You really should be able to have this code live in > fs/proc/proc_sysctl.c and utilize lookup_entry. > > That should give you the ability to lookup any sysctl. If > kernel/sysctl.c is compiled into the kernel proc_sysctl.c is compiled > into the kernel. Systems that don't select CONFIG_PROC_SYSCTL won't > have any sysctl tables installed at all so they do not make sense to > consider or design for. I see. In fact one reason why I tried to avoid the proc stuff was your commit 61a47c1ad3a4 ("sysctl: Remove the sysctl system call") and this part: > As this removes one of the few uses of the internal kernel mount > of proc I hope this allows for even more simplifications of the > proc filesystem. But if you now suggest using the kernel mount then sure, it I don't object make the code simpler and handle all sysctls. > Further it will be faster to lookup the sysctls using the code from > proc_sysctl.c as it constructs an rbtree of all of the entries in > a directory. The code might as well take advantage of that for large > directories. > > Arguably the main sysctl tables in kernel/sysctl.c should be split up so > that things are more localized and there is less global state exported > throughout the kernel. I certainly don't want to discourage anyone from > doing that just so their sysctl can be used on the command line. Fair point. > Hmm. There is a big gotcha in here and I think it should be mentioned. > This code only works because no one has done set_fs(KERNEL_DS). Which > means this only works with strings that are kernel addresses essentially > by mistake. A big fat comment documenting why it is safe to pass in > kernel addresses to a function that takes a "char __user*" pointer > would be very good. Thanks, didn't realize that. > Eric >