On 10/27/2016 05:37 PM, Serge E. Hallyn wrote: > Quoting Nikolay Borisov (kernel@xxxxxxxx): >> >> >> On 10/26/2016 08:25 PM, Serge E. Hallyn wrote: >>> On Wed, Oct 26, 2016 at 03:40:27PM +0300, Nikolay Borisov wrote: >>>> There are container setups which map the same kuids to >>>> different containers. In such situation what will happen is >>>> that same uid's in different containers will map to the same >>>> underlying user on the matchine (e.g. same struct user). One >>>> implication of this is that the number of processes for that >>>> particular user are going to be shared among all the same uids >>>> in the container. This is problematic, as it means a user in >>>> containerA can potentially exhaust the process limit such that >>>> a user in containerB cannot spawn any processes. >>> >>> Hi - thanks for the description. Based on that, though, I worry >>> that it is a feature we do not want. Nothing explicitly prohibits >>> sharing kuids in different containers, but it is is sharing. If >>> you want greater isolation between two containers, you must not share >>> any kuids. >>> >>> I'm not saying nack, but i am saying it seems a misguided feature >>> which could lead people to think sharing uids is safer than it is. >> >> I agree that in order for this to be considered "secure" it relies on >> the assumption that there is no leakage between containers. However, >> there are currently setups which rely on this behavior for whatever >> (mis)guided reasons. Furthermore the current design of namespaces >> doesn't do anything to prevent such uses. Given this I don't think it be >> fair to completely disregard them, hence the patch. > > I somehow had missed the fact that (if I read below correctly) you > are actually solving the problem for RLIMIT_NPROC? That's worthwhile > then. I thought the ucounts checks were independent and RLIMIT_NPROC > failures were still going to mysteriously plague sibling containers. > > I do still worry about the performance impact of adding the get_ucounts() > in those hot paths below. Have you done any perf measurements? > Finally managed to get around doing some benchmarking. I performed tests on 4.9-rc1 with and without my patch. On every kernel I performed 3 tests: - 5 stress-ng instances, doing 250k forks in the init_user_ns - 5 stress-ng instances, doing 250k forks in child of init_user_ns (nest factor of 1) - 5 stress-ng instances, doing 250k forks in child of init_user_ns (nest factor of 5). I ran every experiment 5 times and got the stdev and the average values. Here is how I invoked stress-ng : for i in {1..5}; do time /home/projects/kernel-testing/stress-ng/stress-ng -f 5 --fork-ops 250000 ; done And every unsharing of a namespace was performed by means of "unshare -r" The results are as follows: Stock 4.9-rc1 Patched 4.9-rc1-nproc Real STDEV Sys STDEV Real STDEV Sys STDEV init_user_ns 28.872s 0.167 49.714 0.51 28.386 0.168 49.217 0.899 depth of 1 27.808 0.448 48.869 0.275 28.303 0.225 50.44 0.174 depth of 5 27.614 0.226 49.336 0.43 28.22 0.737 49.72 0.934 The results are almost identical with only 3% diffrence in the sys time of 1 child ns, and 1% with 5 ns child. I think the results are rather noisy. While the tests were running I also observed the % for (dec|inc)_ulimit and they were around 0,06%. Very negligible. I think on the performance side of things this should be good. [SNIP] _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers