On Monday, March 04, 2024 16:05 EET, Christian Brauner <brauner@xxxxxxxxxx> wrote: > On Mon, Mar 04, 2024 at 01:48:19PM +0000, Adrian Ratiu wrote: > > On Monday, March 04, 2024 15:20 EET, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > > On Fri, Mar 01, 2024 at 11:34:42PM +0200, Adrian Ratiu wrote: > > > > Prior to v2.6.39 write access to /proc/<pid>/mem was restricted, > > > > after which it got allowed in commit 198214a7ee50 ("proc: enable > > > > writing to /proc/pid/mem"). Famous last words from that patch: > > > > "no longer a security hazard". :) > > > > > > > > Afterwards exploits appeared started causing drama like [1]. The > > > > /proc/*/mem exploits can be rather sophisticated like [2] which > > > > installed an arbitrary payload from noexec storage into a running > > > > process then exec'd it, which itself could include an ELF loader > > > > to run arbitrary code off noexec storage. > > > > > > > > As part of hardening against these types of attacks, distrbutions > > > > can restrict /proc/*/mem to only allow writes when they makes sense, > > > > like in case of debuggers which have ptrace permissions, as they > > > > are able to access memory anyway via PTRACE_POKEDATA and friends. > > > > > > > > Dropping the mode bits disables write access for non-root users. > > > > Trying to `chmod` the paths back fails as the kernel rejects it. > > > > > > > > For users with CAP_DAC_OVERRIDE (usually just root) we have to > > > > disable the mem_write callback to avoid bypassing the mode bits. > > > > > > > > Writes can be used to bypass permissions on memory maps, even if a > > > > memory region is mapped r-x (as is a program's executable pages), > > > > the process can open its own /proc/self/mem file and write to the > > > > pages directly. > > > > > > > > Even if seccomp filters block mmap/mprotect calls with W|X perms, > > > > they often cannot block open calls as daemons want to read/write > > > > their own runtime state and seccomp filters cannot check file paths. > > > > Write calls also can't be blocked in general via seccomp. > > > > > > > > Since the mem file is part of the dynamic /proc/<pid>/ space, we > > > > can't run chmod once at boot to restrict it (and trying to react > > > > to every process and run chmod doesn't scale, and the kernel no > > > > longer allows chmod on any of these paths). > > > > > > > > SELinux could be used with a rule to cover all /proc/*/mem files, > > > > but even then having multiple ways to deny an attack is useful in > > > > case on layer fails. > > > > > > > > [1] https://lwn.net/Articles/476947/ > > > > [2] https://issues.chromium.org/issues/40089045 > > > > > > > > Based on an initial patch by Mike Frysinger <vapier@xxxxxxxxxxxx>. > > > > > > > > Cc: Guenter Roeck <groeck@xxxxxxxxxxxx> > > > > Cc: Doug Anderson <dianders@xxxxxxxxxxxx> > > > > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > > > > Cc: Jann Horn <jannh@xxxxxxxxxx> > > > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> > > > > Cc: Christian Brauner <brauner@xxxxxxxxxx> > > > > Co-developed-by: Mike Frysinger <vapier@xxxxxxxxxxxx> > > > > Signed-off-by: Mike Frysinger <vapier@xxxxxxxxxxxx> > > > > Signed-off-by: Adrian Ratiu <adrian.ratiu@xxxxxxxxxxxxx> > > > > --- > > > > Changes in v2: > > > > * Added boot time parameter with default kconfig option > > > > * Moved check earlier in mem_open() instead of mem_write() > > > > * Simplified implementation branching > > > > * Removed dependency on CONFIG_MEMCG > > > > --- > > > > .../admin-guide/kernel-parameters.txt | 4 ++ > > > > fs/proc/base.c | 47 ++++++++++++++++++- > > > > security/Kconfig | 22 +++++++++ > > > > 3 files changed, 71 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > > index 460b97a1d0da..0647e2f54248 100644 > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > @@ -5618,6 +5618,10 @@ > > > > reset_devices [KNL] Force drivers to reset the underlying device > > > > during initialization. > > > > > > > > + restrict_proc_mem_write= [KNL] > > > > + Enable or disable write access to /proc/*/mem files. > > > > + Default is SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON. > > > > + > > > > resume= [SWSUSP] > > > > Specify the partition device for software suspend > > > > Format: > > > > diff --git a/fs/proc/base.c b/fs/proc/base.c > > > > index 98a031ac2648..92f668191312 100644 > > > > --- a/fs/proc/base.c > > > > +++ b/fs/proc/base.c > > > > @@ -152,6 +152,30 @@ struct pid_entry { > > > > NULL, &proc_pid_attr_operations, \ > > > > { .lsmid = LSMID }) > > > > > > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE > > > > +DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON, > > > > + restrict_proc_mem_write); > > > > +static int __init early_restrict_proc_mem_write(char *buf) > > > > +{ > > > > + int ret; > > > > + bool bool_result; > > > > + > > > > + ret = kstrtobool(buf, &bool_result); > > > > + if (ret) > > > > + return ret; > > > > + > > > > + if (bool_result) > > > > + static_branch_enable(&restrict_proc_mem_write); > > > > + else > > > > + static_branch_disable(&restrict_proc_mem_write); > > > > + return 0; > > > > +} > > > > +early_param("restrict_proc_mem_write", early_restrict_proc_mem_write); > > > > +# define PROC_PID_MEM_MODE S_IRUSR > > > > +#else > > > > +# define PROC_PID_MEM_MODE (S_IRUSR|S_IWUSR) > > > > +#endif > > > > + > > > > /* > > > > * Count the number of hardlinks for the pid_entry table, excluding the . > > > > * and .. links. > > > > @@ -829,6 +853,25 @@ static int mem_open(struct inode *inode, struct file *file) > > > > { > > > > int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH); > > > > > > > > +#ifdef CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE > > > > + struct mm_struct *mm = file->private_data; > > > > + struct task_struct *task = get_proc_task(inode); > > > > + > > > > + if (mm && task) { > > > > + /* Only allow writes by processes already ptracing the target task */ > > > > + if (file->f_mode & FMODE_WRITE && > > > > + static_branch_maybe(CONFIG_SECURITY_PROC_MEM_RESTRICT_WRITE_DEFAULT_ON, > > > > + &restrict_proc_mem_write)) { > > > > + rcu_read_lock(); > > > > + if (!ptracer_capable(current, mm->user_ns) || > > > > + current != ptrace_parent(task)) > > > > + ret = -EACCES; > > > > > > Uhm, this will break the seccomp notifier, no? So you can't turn on > > > SECURITY_PROC_MEM_RESTRICT_WRITE when you want to use the seccomp > > > notifier to do system call interception and rewrite memory locations of > > > the calling task, no? Which is very much relied upon in various > > > container managers and possibly other security tools. > > > > > > Which means that you can't turn this on in any of the regular distros. > > > > > > So you need to either account for the calling task being a seccomp > > > supervisor for the task whose memory it is trying to access or you need > > > to provide a migration path by adding an api that let's caller's perform > > > these writes through the seccomp notifier. > > > > Thanks for raising this! > > > > I did test seccomp filtering/blocking functionality which seemed to > > work but I'll make sure to also test syscall interception before > > sending v3, to confirm whether it breaks. > > > > The simplest solution is to add an exception for seccomp supervisors > > just like we did for tracers, yes, so I'm inclined to go with that if > > needed. :) > > Ok. Note that your patch also doesn't cover process_vm_writev() which > means that you can still use that as an alternative to write to memory - > albeit with a lote more raciness. IOW, a seccomp notifier can do the > dance of: > > pidfd = clone3(CLONE_PIDFD) > // handle error > int fd_mem = open("/proc/$pid/mem", O_RDWR);: > // handle error > if (pidfd_send_signal(pidfd, 0, ...) == 0) > write(fd_mem, ...); > > which lets it avoid the raciness. That's not possible with > process_vm_writev() especially if it's received via AF_UNIX sockets > which happens if the seccomp notifier lives in a proxy process. I know > that happens in the wild. > > So overall, it seems a bit odd to me because why block /proc/<pid>/mem > specifically and not also cover process_vm_writev()? Because that's easy > to block via regular seccomp system call filtering? Yes, easy to block and also respect page permissions (can't write read-only memory) as well as require ptrace access anyway by checking PTRACE_MODE_ATTACH_REALCREDS.