On Mon, Mar 04, 2024 at 02:20:22PM +0100, Christian Brauner wrote: > On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote: > > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted, > > after which it got allowed in commit 198214a7ee50 ("proc: enable > > writing to /proc/pid/mem"). Famous last words from that patch: > > "no longer a security hazard". :) > > > > Afterwards exploits appeared started causing drama like [1]. The > > /proc/*/mem exploits can be rather sophisticated like [2] which > > installed an arbitrary payload from noexec storage into a running > > process then exec'd it, which itself could include an ELF loader > > to run arbitrary code off noexec storage. > > > > As part of hardening against these types of attacks, distrbutions > > can restrict /proc/*/mem to only allow writes when they makes sense, > > like in case of debuggers which have ptrace permissions, as they > > are able to access memory anyway via PTRACE_POKEDATA and friends. > > > > Dropping the mode bits disables write access for non-root users. > > Trying to `chmod` the paths back fails as the kernel rejects it. > > > > For users with CAP_DAC_OVERRIDE (usually just root) we have to > > disable the mem_write callback to avoid bypassing the mode bits. > > > > Writes can be used to bypass permissions on memory maps, even if a > > memory region is mapped r-x (as is a program's executable pages), > > the process can open its own /proc/self/mem file and write to the > > pages directly. > > > > Even if seccomp filters block mmap/mprotect calls with W|X perms, > > they often cannot block open calls as daemons want to read/write > > their own runtime state and seccomp filters cannot check file paths. > > Write calls also can't be blocked in general via seccomp. > > > > Since the mem file is part of the dynamic /proc/<pid>/ space, we > > can't run chmod once at boot to restrict it (and trying to react > > to every process and run chmod doesn't scale, and the kernel no > > longer allows chmod on any of these paths). > > > > SELinux could be used with a rule to cover all /proc/*/mem files, > > but even then having multiple ways to deny an attack is useful in > > case on layer fails. > > > > [1] https://lwn.net/Articles/476947/ > > [2] https://issues.chromium.org/issues/40089045 > > > > Based on an initial patch by Mike Frysinger <vapier@xxxxxxxxxxxx>. > > > > Cc: Guenter Roeck <groeck@xxxxxxxxxxxx> > > Cc: Doug Anderson <dianders@xxxxxxxxxxxx> > > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > > Cc: Jann Horn <jannh@xxxxxxxxxx> > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> > > Cc: Christian Brauner <brauner@xxxxxxxxxx> > > Co-developed-by: Mike Frysinger <vapier@xxxxxxxxxxxx> > > Signed-off-by: Mike Frysinger <vapier@xxxxxxxxxxxx> > > Signed-off-by: Adrian Ratiu <adrian.ratiu@xxxxxxxxxxxxx> > > --- > > Changes in v2: > > * Added boot time parameter with default kconfig option > > * Moved check earlier in mem_open() instead of mem_write() > > * Simplified implementation branching > > * Removed dependency on CONFIG_MEMCG > > --- > > .../admin-guide/kernel-parameters.txt | 4 ++ > > fs/proc/base.c | 47 ++++++++++++++++++- > > security/Kconfig | 22 +++++++++ > > 3 files changed, 71 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 460b97a1d0da..0647e2f54248 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5618,6 +5618,10 @@ > > reset_devices [KNL] Force drivers to reset the underlying device > > during initialization. > > > > + restrict_proc_mem_write= [KNL] > > + Enable or disable write access to /proc/*/mem files. > > + Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON. > > + > > resume= [SWSUSP] > > Specify the partition device for software suspend > > Format: > > diff --git a/fs/proc/base.c b/fs/proc/base.c > > index 98a031ac2648..92f668191312 100644 > > --- a/fs/proc/base.c > > +++ b/fs/proc/base.c > > @@ -152,6 +152,30 @@ struct pid_entry { > > NULL, &proc_pid_attr_operations, \ > > { .lsmid = LSMID }) > > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE > > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON, > > + restrict_proc_mem_write); > > +static int __init early_restrict_proc_mem_write(char *buf) > > +{ > > + int ret; > > + bool bool_result; > > + > > + ret = kstrtobool(buf, &bool_result); > > + if (ret) > > + return ret; > > + > > + if (bool_result) > > + static_branch_enable(&restrict_proc_mem_write); > > + else > > + static_branch_disable(&restrict_proc_mem_write); > > + return 0; > > +} > > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write); > > +# define PROC_PID_MEM_MODE S_IRUSR > > +#else > > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR) > > +#endif > > + > > /* > > * Count the number of hardlinks for the pid_entry table, excluding the . > > * and .. links. > > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file) > > { > > int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH); > > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE > > + struct mm_struct *mm = file->private_data; > > + struct task_struct *task = get_proc_task(inode); > > + > > + if (mm && task) { > > + /* Only allow writes by processes already ptracing the target task */ > > + if (file->f_mode & FMODE_WRITE && > > + static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON, > > + &restrict_proc_mem_write)) { > > + rcu_read_lock(); > > + if (!ptracer_capable(current, mm->user_ns) || > > + current != ptrace_parent(task)) > > + ret = -EACCES; > > Uhm, this will break the seccomp notifier, no? So you can't turn on > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp > notifier to do system call interception and rewrite memory locations of > the calling task, no? Which is very much relied upon in various > container managers and possibly other security tools. > > Which means that you can't turn this on in any of the regular distros. FWIW, it's a run-time toggle, but yes, let's make sure this works correctly. > So you need to either account for the calling task being a seccomp > supervisor for the task whose memory it is trying to access or you need > to provide a migration path by adding an api that let's caller's perform > these writes through the seccomp notifier. How do seccomp supervisors that use USER_NOTIF do those kinds of memory writes currently? I thought they were actually using ptrace? Everything I'm familiar with is just using SECCOMP_IOCTL_NOTIF_ADDFD, and not doing fancy memory pokes. -Kees -- Kees Cook