Re: [RFC PATCH] rlimit: Account nproc per-usernamespace/per-user

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Quoting Nikolay Borisov (kernel@xxxxxxxx):
> 
> 
> On 10/26/2016 08:25 PM, Serge E. Hallyn wrote:
> > On Wed, Oct 26, 2016 at 03:40:27PM +0300, Nikolay Borisov wrote:
> >> There are container setups which map the same kuids to
> >> different containers. In such situation what will happen is
> >> that same uid's in different containers will map to the same
> >> underlying user on the matchine (e.g. same struct user). One
> >> implication of this is that the number of processes for that
> >> particular user are going to be shared among all the same uids
> >> in the container. This is problematic, as it means a user in
> >> containerA can potentially exhaust the process limit such that
> >> a user in containerB cannot spawn any processes.
> > 
> > Hi - thanks for the description.  Based on that, though, I worry
> > that it is a feature we do not want.  Nothing explicitly prohibits
> > sharing kuids in different containers, but it is is sharing.  If
> > you want greater isolation between two containers, you must not share
> > any kuids.
> > 
> > I'm not saying nack, but i am saying it seems a misguided feature
> > which could lead people to think sharing uids is safer than it is.
> 
> I agree that in order for this to be considered "secure" it relies on
> the assumption that there is no leakage between containers. However,
> there are currently setups which rely on this behavior for whatever
> (mis)guided reasons. Furthermore the current design of namespaces
> doesn't do anything to prevent such uses. Given this I don't think it be
> fair to completely disregard them, hence the patch.

I somehow had missed the fact that (if I read below correctly) you
are actually solving the problem for RLIMIT_NPROC?  That's worthwhile
then.  I thought the ucounts checks were independent and RLIMIT_NPROC
failures were still going to mysteriously plague sibling containers.

I do still worry about the performance impact of adding the get_ucounts()
in those hot paths below.  Have you done any perf measurements?

> >> Fix this by utilising the newly introduced UCOUNT infrastructure,
> >> so that process counts are now being accounted per-userns/per-user.
> >> In case when a machine is running in non-container setup then
> >> the accounting semantics is virtually the same, given that there is
> >> going to be one ucount-per-user struct.
> >>
> >> Signed-off-by: Nikolay Borisov <kernel@xxxxxxxx>
> >> ---
> >> Hello Eric, 
> >>
> >> Now that ucount has been merged I took the liberty of 
> >> experimenting what the nproc-per-userns approach would 
> >> look like and here is the result. In my previous [0] you 
> >> expressed concerns regarding not having hierarchical limits, 
> >> essentially allowing users to circumvent their limits. So
> >> this version doesn't do anything to rectify this issue. 
> >> However, I do think the idea of the rlimit is not to 
> >> forbid the user of spawning more processes, since currently
> >> the limit is being set on a per-process basis, so a malicious 
> >> process can in fact increase that limits and spawn a large number 
> >> of processes. Given this I believe the current patch doesn't make
> >> the situation any worse in that regard. 
> >>
> >> [0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html
> >>
> >>  fs/exec.c                      |  2 +-
> >>  include/linux/cred.h           |  1 +
> >>  include/linux/sched.h          |  1 -
> >>  include/linux/user_namespace.h |  8 ++++++++
> >>  kernel/cred.c                  | 18 +++++++++++-------
> >>  kernel/exit.c                  |  2 +-
> >>  kernel/fork.c                  |  7 +++++--
> >>  kernel/sys.c                   |  2 +-
> >>  kernel/ucount.c                | 16 ++++++++++++++++
> >>  kernel/user.c                  |  1 -
> >>  10 files changed, 44 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index 6fcfb3f7b137..8126e00a8d3e 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1651,7 +1651,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> >>  	 * whether NPROC limit is still exceeded.
> >>  	 */
> >>  	if ((current->flags & PF_NPROC_EXCEEDED) &&
> >> -	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
> >> +	    atomic_read(&current_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
> >>  		retval = -EAGAIN;
> >>  		goto out_ret;
> >>  	}
> >> diff --git a/include/linux/cred.h b/include/linux/cred.h
> >> index f0e70a1bb3ac..c3bb4d20ff16 100644
> >> --- a/include/linux/cred.h
> >> +++ b/include/linux/cred.h
> >> @@ -141,6 +141,7 @@ struct cred {
> >>  	void		*security;	/* subjective LSM security */
> >>  #endif
> >>  	struct user_struct *user;	/* real user ID subscription */
> >> +	struct ucounts *ucounts;		/* container for per-userns/per-counts */
> >>  	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
> >>  	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
> >>  	struct rcu_head	rcu;		/* RCU deletion hook */
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 348f51b0ec92..3709d70fe2de 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -842,7 +842,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
> >>   */
> >>  struct user_struct {
> >>  	atomic_t __count;	/* reference count */
> >> -	atomic_t processes;	/* How many processes does this user have? */
> >>  	atomic_t sigpending;	/* How many pending signals does this user have? */
> >>  #ifdef CONFIG_INOTIFY_USER
> >>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
> >> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> >> index eb209d4523f5..d4fc31ae9d85 100644
> >> --- a/include/linux/user_namespace.h
> >> +++ b/include/linux/user_namespace.h
> >> @@ -35,6 +35,11 @@ enum ucount_type {
> >>  	UCOUNT_COUNTS,
> >>  };
> >>  
> >> +enum ulimit_type {
> >> +	ULIMIT_NPROC,
> >> +	ULIMIT_COUNTS,
> >> +};
> >> +
> >>  struct user_namespace {
> >>  	struct uid_gid_map	uid_map;
> >>  	struct uid_gid_map	gid_map;
> >> @@ -67,6 +72,7 @@ struct ucounts {
> >>  	kuid_t uid;
> >>  	atomic_t count;
> >>  	atomic_t ucount[UCOUNT_COUNTS];
> >> +	atomic_t ulimit[ULIMIT_COUNTS];
> >>  };
> >>  
> >>  extern struct user_namespace init_user_ns;
> >> @@ -75,6 +81,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
> >>  void retire_userns_sysctls(struct user_namespace *ns);
> >>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
> >>  void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
> >> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
> >> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
> >>  
> >>  #ifdef CONFIG_USER_NS
> >>  
> >> diff --git a/kernel/cred.c b/kernel/cred.c
> >> index 5f264fb5737d..dd9ffc0396bb 100644
> >> --- a/kernel/cred.c
> >> +++ b/kernel/cred.c
> >> @@ -330,13 +330,16 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> >>  #endif
> >>  		clone_flags & CLONE_THREAD
> >>  	    ) {
> >> -		p->real_cred = get_cred(p->cred);
> >> +		struct cred *c;
> >> +		struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
> >> +						ULIMIT_NPROC);
> >> +		p->real_cred = c = get_cred(p->cred);
> >>  		get_cred(p->cred);
> >>  		alter_cred_subscribers(p->cred, 2);
> >>  		kdebug("share_creds(%p{%d,%d})",
> >>  		       p->cred, atomic_read(&p->cred->usage),
> >>  		       read_cred_subscribers(p->cred));
> >> -		atomic_inc(&p->cred->user->processes);
> >> +		c->ucounts = uc;
> >>  		return 0;
> >>  	}
> >>  
> >> @@ -369,7 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> >>  	}
> >>  #endif
> >>  
> >> -	atomic_inc(&new->user->processes);
> >> +	new->ucounts = inc_ulimit(new->user_ns, new->uid,
> >> +				      ULIMIT_NPROC);
> >>  	p->cred = p->real_cred = get_cred(new);
> >>  	alter_cred_subscribers(new, 2);
> >>  	validate_creds(new);
> >> @@ -461,12 +465,12 @@ int commit_creds(struct cred *new)
> >>  	 * in set_user().
> >>  	 */
> >>  	alter_cred_subscribers(new, 2);
> >> -	if (new->user != old->user)
> >> -		atomic_inc(&new->user->processes);
> >> +	if (new->user != old->user || old->user_ns != new->user_ns)
> >> +		new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
> >>  	rcu_assign_pointer(task->real_cred, new);
> >>  	rcu_assign_pointer(task->cred, new);
> >> -	if (new->user != old->user)
> >> -		atomic_dec(&old->user->processes);
> >> +	if (new->user != old->user || old->user_ns != new->user_ns)
> >> +		dec_ulimit(old->ucounts, ULIMIT_NPROC);
> >>  	alter_cred_subscribers(old, -2);
> >>  
> >>  	/* send notifications */
> >> diff --git a/kernel/exit.c b/kernel/exit.c
> >> index 9d68c45ebbe3..7ac8954230bd 100644
> >> --- a/kernel/exit.c
> >> +++ b/kernel/exit.c
> >> @@ -173,7 +173,7 @@ void release_task(struct task_struct *p)
> >>  	/* don't need to get the RCU readlock here - the process is dead and
> >>  	 * can't be modifying its own credentials. But shut RCU-lockdep up */
> >>  	rcu_read_lock();
> >> -	atomic_dec(&__task_cred(p)->user->processes);
> >> +	dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
> >>  	rcu_read_unlock();
> >>  
> >>  	proc_flush_task(p);
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 623259fc794d..0b16c9d920b1 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -425,6 +425,7 @@ int arch_task_struct_size __read_mostly;
> >>  void __init fork_init(void)
> >>  {
> >>  	int i;
> >> +	struct cred *c = init_task.cred;
> >>  #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
> >>  #ifndef ARCH_MIN_TASKALIGN
> >>  #define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
> >> @@ -448,6 +449,8 @@ void __init fork_init(void)
> >>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
> >>  		init_user_ns.ucount_max[i] = max_threads/2;
> >>  	}
> >> +
> >> +	c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
> >>  }
> >>  
> >>  int __weak arch_dup_task_struct(struct task_struct *dst,
> >> @@ -1515,7 +1518,7 @@ static __latent_entropy struct task_struct *copy_process(
> >>  	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
> >>  #endif
> >>  	retval = -EAGAIN;
> >> -	if (atomic_read(&p->real_cred->user->processes) >=
> >> +	if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
> >>  			task_rlimit(p, RLIMIT_NPROC)) {
> >>  		if (p->real_cred->user != INIT_USER &&
> >>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> >> @@ -1859,7 +1862,7 @@ static __latent_entropy struct task_struct *copy_process(
> >>  #endif
> >>  	delayacct_tsk_free(p);
> >>  bad_fork_cleanup_count:
> >> -	atomic_dec(&p->cred->user->processes);
> >> +	dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
> >>  	exit_creds(p);
> >>  bad_fork_free:
> >>  	put_task_stack(p);
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index 89d5be418157..143d04ed9da5 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -433,7 +433,7 @@ static int set_user(struct cred *new)
> >>  	 * for programs doing set*uid()+execve() by harmlessly deferring the
> >>  	 * failure to the execve() stage.
> >>  	 */
> >> -	if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
> >> +	if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
> >>  			new_user != INIT_USER)
> >>  		current->flags |= PF_NPROC_EXCEEDED;
> >>  	else
> >> diff --git a/kernel/ucount.c b/kernel/ucount.c
> >> index 9d20d5dd298a..d8f5362f6e17 100644
> >> --- a/kernel/ucount.c
> >> +++ b/kernel/ucount.c
> >> @@ -181,6 +181,22 @@ static inline bool atomic_inc_below(atomic_t *v, int u)
> >>  	}
> >>  }
> >>  
> >> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
> >> +			   enum ulimit_type type)
> >> +{
> >> +	struct ucounts *ucounts = get_ucounts(ns, uid);
> >> +	atomic_inc(&ucounts->ulimit[type]);
> >> +
> >> +	return ucounts;
> >> +}
> >> +
> >> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
> >> +{
> >> +	atomic_dec(&ucounts->ulimit[type]);
> >> +	WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
> >> +	put_ucounts(ucounts);
> >> +}
> >> +
> >>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
> >>  			   enum ucount_type type)
> >>  {
> >> diff --git a/kernel/user.c b/kernel/user.c
> >> index b069ccbfb0b0..ce1ba1fb96c0 100644
> >> --- a/kernel/user.c
> >> +++ b/kernel/user.c
> >> @@ -90,7 +90,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
> >>  /* root_user.__count is 1, for init task cred */
> >>  struct user_struct root_user = {
> >>  	.__count	= ATOMIC_INIT(1),
> >> -	.processes	= ATOMIC_INIT(1),
> >>  	.sigpending	= ATOMIC_INIT(0),
> >>  	.locked_shm     = 0,
> >>  	.uid		= GLOBAL_ROOT_UID,
> >> -- 
> >> 2.5.0
> >>
> >> _______________________________________________
> >> Containers mailing list
> >> Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> >> https://lists.linuxfoundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/containers



[Index of Archives]     [Cgroups]     [Netdev]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux