On Saturday, March 02, 2024 01:55 EET, Kees Cook <keescook@xxxxxxxxxxxx> wrote: > On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote: > > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted, > > after which it got allowed in commit 198214a7ee50 ("proc: enable > > writing to /proc/pid/mem"). Famous last words from that patch: > > "no longer a security hazard". :) > > > > Afterwards exploits appeared started causing drama like [1]. The > > nit: I think "appeared" can be dropped here. > > > /proc/*/mem exploits can be rather sophisticated like [2] which > > installed an arbitrary payload from noexec storage into a running > > process then exec'd it, which itself could include an ELF loader > > to run arbitrary code off noexec storage. > > > > As part of hardening against these types of attacks, distrbutions > > can restrict /proc/*/mem to only allow writes when they makes sense, > > like in case of debuggers which have ptrace permissions, as they > > are able to access memory anyway via PTRACE_POKEDATA and friends. > > > > Dropping the mode bits disables write access for non-root users. > > Trying to `chmod` the paths back fails as the kernel rejects it. > > > > For users with CAP_DAC_OVERRIDE (usually just root) we have to > > disable the mem_write callback to avoid bypassing the mode bits. > > > > Writes can be used to bypass permissions on memory maps, even if a > > memory region is mapped r-x (as is a program's executable pages), > > the process can open its own /proc/self/mem file and write to the > > pages directly. > > > > Even if seccomp filters block mmap/mprotect calls with W|X perms, > > they often cannot block open calls as daemons want to read/write > > their own runtime state and seccomp filters cannot check file paths. > > Write calls also can't be blocked in general via seccomp. > > > > Since the mem file is part of the dynamic /proc/<pid>/ space, we > > can't run chmod once at boot to restrict it (and trying to react > > to every process and run chmod doesn't scale, and the kernel no > > longer allows chmod on any of these paths). > > > > SELinux could be used with a rule to cover all /proc/*/mem files, > > but even then having multiple ways to deny an attack is useful in > > case on layer fails. > > Everything above here is good to keep in the commit log, but it's all > the "background". Please also write here what has been done to address > the background above it. e.g.: > > "Introduce a CONFIG and a __ro_after_init runtime toggle to make > it so only processes that are already tracing the task to write to > /proc/<pid>/mem." etc > > > > > [1] https://lwn.net/Articles/476947/ > > [2] https://issues.chromium.org/issues/40089045 > > These can be: > > Link: https://lwn.net/Articles/476947/ [1] > Link: https://issues.chromium.org/issues/40089045 [2] > > > Based on an initial patch by Mike Frysinger <vapier@xxxxxxxxxxxx>. > > > > Cc: Guenter Roeck <groeck@xxxxxxxxxxxx> > > Cc: Doug Anderson <dianders@xxxxxxxxxxxx> > > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > > Cc: Jann Horn <jannh@xxxxxxxxxx> > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> > > Cc: Christian Brauner <brauner@xxxxxxxxxx> > > Co-developed-by: Mike Frysinger <vapier@xxxxxxxxxxxx> > > Signed-off-by: Mike Frysinger <vapier@xxxxxxxxxxxx> > > Signed-off-by: Adrian Ratiu <adrian.ratiu@xxxxxxxxxxxxx> > > --- > > Changes in v2: > > * Added boot time parameter with default kconfig option > > * Moved check earlier in mem_open() instead of mem_write() > > * Simplified implementation branching > > * Removed dependency on CONFIG_MEMCG > > Can you mention in the commit log what behaviors have been tested with > this patch? For example, I assume gdb still works with > restrict_proc_mem_write=y ? > Thanks, I will address all the above commit message feedback in v3. Yes, gdb and gdbserver work with restrict_proc_mem_write=y. My testing is focused on the correct functioning of GDB and gdbserver (lldb/server use ptrace POKEDATA so they work regardless of restrict_proc_mem_write). This all started from my attempt to fix gdbserver by adding a ptrace fallback in case /proc/pid/mem writes are blocked without any exception, because that breaks basic functionality like setting breakpoints. GDB upstream NAK'ed my ptrace fallback approach because it's doesn't work well with their /proc/pid/mem focused design required for non-stop mode (the default all-stop mode is emulated on top of non-stop), as well as ptrace peek/poke requiring a live task which can cause memory access problems if the ptraced task dies. Other solutions were considered by GDB upstream, including using the process_vm_writev & co, but they respect page permissions and GDB has to write RO pages to set breakpoints. In the end GDB maintainers directed me to do a proper kernel fix with an exception for tasks already ptracing others, because from a security perspective, they can already access tracee memory regardless of /proc/pid/mem restrictions, so here we are. :) > When this is enabled, what _does_ break that people might expect to > work? With the current iteration, all things I tested work as expected. It is rather hard to come up with things that break with restrict_proc_mem_write=y, because other than debuggers and exploits I don't have other use-cases. Obvious things like "echo >/proc/self/mem" get permission denied, but that is expected with restrict_proc_mem_write=y, so I wouldn't classify it as breakage. In theory there might be some weird/legacy apps which might break, so that is why I suggest we land the mechanism as default off, and later, after it gets tested in various distributions, pull the trigger to make it default on. > > > --- > > .../admin-guide/kernel-parameters.txt | 4 ++ > > fs/proc/base.c | 47 ++++++++++++++++++- > > security/Kconfig | 22 +++++++++ > > 3 files changed, 71 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 460b97a1d0da..0647e2f54248 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5618,6 +5618,10 @@ > > reset_devices [KNL] Force drivers to reset the underlying device > > during initialization. > > > > + restrict_proc_mem_write= [KNL] > > Please add here: > > Format: <bool> > Ack, will do in v3. > > + Enable or disable write access to /proc/*/mem files. > > + Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON. > > + > > resume= [SWSUSP] > > Specify the partition device for software suspend > > Format: > > diff --git a/fs/proc/base.c b/fs/proc/base.c > > index 98a031ac2648..92f668191312 100644 > > --- a/fs/proc/base.c > > +++ b/fs/proc/base.c > > @@ -152,6 +152,30 @@ struct pid_entry { > > NULL, &proc_pid_attr_operations, \ > > { .lsmid = LSMID }) > > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE > > Please drop this CONFIG entirely -- it should be always available for > all builds of the kernel. Only CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON > needs to remain. > Ack, will do in v3. > > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON, > > + restrict_proc_mem_write); > > +static int __init early_restrict_proc_mem_write(char *buf) > > +{ > > + int ret; > > + bool bool_result; > > + > > + ret = kstrtobool(buf, &bool_result); > > + if (ret) > > + return ret; > > + > > + if (bool_result) > > + static_branch_enable(&restrict_proc_mem_write); > > + else > > + static_branch_disable(&restrict_proc_mem_write); > > + return 0; > > +} > > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write); > > +# define PROC_PID_MEM_MODE S_IRUSR > > +#else > > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR) > > +#endif > > PROC_PID_MEM_MODE will need to be a __ro_after_init variable, set by > early_restrict_proc_mem_write, otherwise the mode won't change based on > the runtime setting. e.g.: > > #ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON > mode_t proc_pid_mem_mode __ro_after_init = S_IRUSR; > #else > mode_t proc_pid_mem_mode __ro_after_init = (S_IRUSR|S_IWUSR); > #endif > > DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON, > restrict_proc_mem_write); > ... > if (bool_result) { > static_branch_enable(&restrict_proc_mem_write); > proc_pid_mem_mode = S_IRUSR; > } else { > static_branch_disable(&restrict_proc_mem_write); > proc_pid_mem_mode = (S_IRUSR|S_IWUSR); > } > ... > REG("mem", proc_pid_mem_mode, proc_mem_operations), > > Ack, will do in v3. > > + > > /* > > * Count the number of hardlinks for the pid_entry table, excluding the . > > * and .. links. > > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file) > > { > > int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH); > > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE > > Drop this ifdef (as mentioned above). > > > + struct mm_struct *mm = file->private_data; > > + struct task_struct *task = get_proc_task(inode); > > + > > + if (mm && task) { > > + /* Only allow writes by processes already ptracing the target task */ > > + if (file->f_mode & FMODE_WRITE && > > + static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON, > > + &restrict_proc_mem_write)) { > > Do we need to also do an mm_access() on the task to verify that the task > we're about to check has its mm still matching file->private_data? The > PID can change out from under us (but the mm cannot). > Likely yes, will look into this for v3. > > + rcu_read_lock(); > > + if (!ptracer_capable(current, mm->user_ns) || > > + current != ptrace_parent(task)) > > If you're just allowing "already ptracing", why include the > ptracer_capable() check? > It is a very good observation that the check is redundant. :) It is a remnant from a previous iteration of this patch, from when I was proposing solutions to GDB upstream. I left it there because it doesn't do much harm to verify capability as well, more of a precaution / test invariant than anything else. I'll remove it in v3 since it might cause confusion. > > + ret = -EACCES; > > + rcu_read_unlock(); > > + } > > + put_task_struct(task); > > + } > > +#endif > > + > > /* OK to pass negative loff_t, we can catch out-of-range */ > > file->f_mode |= FMODE_UNSIGNED_OFFSET; > > > > @@ -3281,7 +3324,7 @@ static const struct pid_entry tgid_base_stuff[] = { > > #ifdef CONFIG_NUMA > > REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), > > #endif > > - REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), > > + REG("mem", PROC_PID_MEM_MODE, proc_mem_operations), > > LNK("cwd", proc_cwd_link), > > LNK("root", proc_root_link), > > LNK("exe", proc_exe_link), > > @@ -3631,7 +3674,7 @@ static const struct pid_entry tid_base_stuff[] = { > > #ifdef CONFIG_NUMA > > REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), > > #endif > > - REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), > > + REG("mem", PROC_PID_MEM_MODE, proc_mem_operations), > > LNK("cwd", proc_cwd_link), > > LNK("root", proc_root_link), > > LNK("exe", proc_exe_link), > > diff --git a/security/Kconfig b/security/Kconfig > > index 412e76f1575d..ffee9e847ed9 100644 > > --- a/security/Kconfig > > +++ b/security/Kconfig > > @@ -19,6 +19,28 @@ config SECURITY_DMESG_RESTRICT > > > > If you are unsure how to answer this question, answer N. > > > > +config SECURITY_PROC_MEM_RESTRICT_WRITE > > + bool "Restrict /proc/*/mem write access" > > + default n > > + help > > + This restricts writes to /proc/<pid>/mem, except when the current > > + process ptraces the /proc/<pid>/mem task, because a ptracer already > > + has write access to the tracee memory. > > + > > + Write access to this file allows bypassing memory map permissions, > > + such as modifying read-only code. > > + > > + If you are unsure how to answer this question, answer N. > > + > > +config SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON > > + bool "Default state of /proc/*/mem write restriction" > > + depends on SECURITY_PROC_MEM_RESTRICT_WRITE > > + default y > > + help > > + /proc/*/mem write access is controlled by kernel boot param > > + "restrict_proc_mem_write" and this config chooses the default > > + boot state. > > As mentioned, I'd say merge the help texts here, but drop > SECURITY_PROC_MEM_RESTRICT_WRITE. > Ack, will do in v3. > > + > > config SECURITY > > bool "Enable different security models" > > depends on SYSFS > > -- > > 2.30.2 > > > > Thanks for this! I look forward to turning it on. :) > Thank you very much for all your feedback! It is much appreciated. I'll wait a few more days before sending v3 to let others comment, then address everything. > -Kees > > -- > Kees Cook