Re: [RFC][PATCH] cgroup_threadgroup_rwsem - affects scalability and OOM

Zefan Li <lizefan@xxxxxxxxxx> · Tue, 9 Aug 2016 17:00:59 +0800

> This almost stalls the system, this patch moves the threadgroup_change_begin
> from before cgroup_fork() to just before cgroup_canfork(). Ideally we shouldn't
> have to worry about threadgroup changes till the task is actually added to
> the threadgroup. This avoids having to call reclaim with cgroup_threadgroup_rwsem
> held.
> 
> There are other theoretical issues with this semaphore
> 
> systemd can do
> 
> 1. cgroup_mutex (cgroup_kn_lock_live)
> 2. cgroup_threadgroup_rwsem (W) (__cgroup_procs_write)
> 
> and other threads can go
> 
> 1. cgroup_threadgroup_rwsem (R) (copy_process)
> 2. mem_cgroup_iter (as a part of reclaim) (cgroup_mutex -- rcu lock or cgroup_mutex)
> 
> However, I've not examined them in too much detail or looked at lockdep
> wait chains for those paths.
> 
> I am sure there is a good reason for placing cgroup_threadgroup_rwsem
> where it is today and I might be missing something. I am also surprised
> no-one else has run into it so far.
> 
> Comments?
> 

We used to use cgroup_threadgroup_rwsem for syncronization between threads
in the same threadgroup, but now it has evolved to ensure atomic operations
across multi processes.

For example, I'm trying to fix a race. See https://lkml.org/lkml/2016/8/8/900

And the fix kind of relies on the fact that cgroup_post_fork() is placed
inside the read section of cgroup_threadgroup_rwsem, so that cpuset_fork()
won't race with cgroup migration.

> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> 
> Signed-off-by: Balbir Singh <bsingharora@xxxxxxxxx>
> ---
>  kernel/fork.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355..0474fa8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1406,7 +1406,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	p->real_start_time = ktime_get_boot_ns();
>  	p->io_context = NULL;
>  	p->audit_context = NULL;
> -	threadgroup_change_begin(current);
>  	cgroup_fork(p);
>  #ifdef CONFIG_NUMA
>  	p->mempolicy = mpol_dup(p->mempolicy);
> @@ -1558,6 +1557,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	INIT_LIST_HEAD(&p->thread_group);
>  	p->task_works = NULL;
>  
> +	threadgroup_change_begin(current);
>  	/*
>  	 * Ensure that the cgroup subsystem policies allow the new process to be
>  	 * forked. It should be noted the the new process's css_set can be changed
> @@ -1658,6 +1658,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  bad_fork_cancel_cgroup:
>  	cgroup_cancel_fork(p);
>  bad_fork_free_pid:
> +	threadgroup_change_end(current);
>  	if (pid != &init_struct_pid)
>  		free_pid(pid);
>  bad_fork_cleanup_thread:
> @@ -1690,7 +1691,6 @@ bad_fork_cleanup_policy:
>  	mpol_put(p->mempolicy);
>  bad_fork_cleanup_threadgroup_lock:
>  #endif
> -	threadgroup_change_end(current);
>  	delayacct_tsk_free(p);
>  bad_fork_cleanup_count:
>  	atomic_dec(&p->cred->user->processes);
> 

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html