On Mon, Jun 30, 2014 at 08:54:34PM +0200, Hannes Frederic Sowa wrote: > Hi, > > On Mon, Jun 30, 2014, at 20:15, Jesper Dangaard Brouer wrote: > > > > On Fri, 27 Jun 2014 22:12:52 -0700 ebiederm@xxxxxxxxxxxx (Eric W. > > Biederman) wrote: > > > Cong Wang <xiyou.wangcong@xxxxxxxxx> writes: > > > > On Thu, Jun 26, 2014 at 3:44 PM, David Miller <davem@xxxxxxxxxxxxx> wrote: > > > >> > > [...] > > > > > > > > Hmm, I did overlook the potential DOS problem. But hold on, isn't > > > > IP fragments have the same problem? The fragment queues are per > > > > netns, and the thresh is per netns as well, we will eventually have > > > > memory pressure as well. > > > > > > Interesting. It does look like ip fragments are susceptible that way. > > > > For IP fragments we have per netns mem-limit and LRU-list, but all > > netns share the same hash table, which have its own DoS potential. > > > > And argh! - we have a hardcoded INETFRAGS_MAXDEPTH=128, which can be > > used for (slow) DoS of IP frags if enough netns are created. > > > > https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/net/ipv4/inet_fragment.c#n344 > > > > Introduced by commit 5a3da1fe9 ("inet: limit length of fragment queue > > hash table bucket lists"). > > Sure, but we need that, otherwise even a single netns can get exploited > up to a remotely triggered lockup of the box - e.g. > https://gist.github.com/hannes/5116331 - on some smaller machines. > INETFRAGS_MAXDEPTH is a property of the hashtable and walking a chain > with more than 128 elements is just crazy. > > Also, for me making this user configurable doesn't seem to provide a > benefit. > > Sure, it does introduce some kind of unfairness between the namespaces, > but so does all code which overcommits shared resources. > > Bye, > Hannes Hello, As a way to test this issue and show how easy it is to DoS a machine by filling the IPv6 neighborhood table, I've written this small example: https://dl.stgraber.org/ipv6-dos.c This can be run as a nobody user on any kernel with user namespaces enabled. What it does is unshare a new user namespace and then a new network namespace inside it. It then creates a veth pair, assigns 4000 IPv6 addresses on the first interface of the pair, then forks, unshares another network namespace, moves the second interface of the pair in there and assigns another 4000 IPv6 addresses. At that point, you have two interfaces, one in the first network namespace the second in the other network namespace, each with 4000 IPv6 addresses. This tool will then start a simple TCP server in one of the namespace and in the other, open 4000 connections, each using a different source and destination address. The result is 4000 open connections, in theory requiring 8000 IPv6 neighborhood table entries. Once the tool is done attempting to open that many connections, any attempt to connect to a host in a directly connected IPv6 subnet (so requiring a new neighborhood table entry) will fail with EINVAL. While the global limit can indeed be bumped, so can the number of connections established by this tool. I don't believe a global limit influence by the number of namespaces would help here either since whatever the resulting global limit ends up being, the tool can be changed to establish $global_limit+1 connections. I'm mostly a userspace guy and don't really know the details of the kernel implementation, but considering that device creation and adding addresses is now possible by any unprivileged user, having the limit of neighborhood entries be per-interface rather than global would make sense to me. Hopefully this helped clarifiy the problem we've been seeing lately. -- Stéphane Graber Ubuntu developer http://www.canonical.com
Attachment:
signature.asc
Description: Digital signature
_______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers