[no subject]

**Date** **Thread**

Outside of that, it still begs the question if the ucounts can/should
be used for something like root in a namespace creating tons of veths
and put a cap to that.

> I am answering
> the question of how I understand limits to work.

It does!

> Luis does this explanation of how limits work help?

Yup thanks!

> >> From admin's perspective it is preferred to have minimal
> >> knobs to set and if these objects are charged to memcg then the memcg
> >> limits would limit them. There was similar situation for inotify
> >> instances where fs sysctl inotify/max_user_instances already limits the
> >> inotify instances but we memcg charged them to not worry about setting
> >> such limits. See ac7b79fd190b ("inotify, memcg: account inotify
> >> instances to kmemcg").
> >
> > Yes but we want sensible defaults out of the box. What those should be
> > IMHO might be work which needs to be figured out well.
> >
> > IMHO today's ulimits are a bit over the top today. This is off slightly
> > off topic but for instance play with:
> >
> > git clone https://github.com/ColinIanKing/stress-ng
> > cd stress-ng
> > make -j 8
> > echo 0 > /proc/sys/vm/oom_dump_tasks                                            
> > i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 --unshare-ops 10000;  sleep 1; let i=$i+1; done
> >
> > If you see:
> >
> > [  217.798124] cgroup: fork rejected by pids controller in
> > /user.slice/user-1000.slice/session-1.scope
> >                                                                                 
> > Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:
> >
> > [Slice]                                                                         
> > TasksMax=MAX_TASKS|infinity
> >
> > Even though we have max_threads set to 61343, things ulimits have a
> > different limit set, and what this means is the above can end up easily
> > creating over 1048576 (17 times max_threads) threads all eagerly doing
> > nothing to just exit, essentially allowing a sort of fork bomb on exit.
> > Your system may or not fall to its knees.
> 
> What max_threads are you talking about here?

Sorry for not being clear, in the kernel this is exposed as max_threads

Which is initialize do kernel/fork.c

root@linus-blktests-block ~ # cat /proc/sys/kernel/threads-max
62157

> The global max_threads
> exposed in /proc/sys/kernel/threads-max?  I don't see how you can get
> around that. 

Yeah I was perplexed and I don't think it's just me.

> Especially since the count is not decremented until the
> process is reaped.
>
> Or is this the pids controller having a low limit and
> /proc/sys/kernel/threads-max having a higher limit?

Not sure, I used defailt debian testing with the above change
to /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to
TasksMax=MAX_TASKS|infinity

> I really have not looked at this pids controller.
> 
> So I am not certain I understand your example here but I hope I have
> answered your question.

During experimentation with the above stress-ng test case, I saw tons
of thread just waiting to do exit:

diff --git a/kernel/exit.c b/kernel/exit.c
index 80c4a67d2770..653ca7ebfb58 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -730,11 +730,24 @@ static void check_stack_usage(void)
 static inline void check_stack_usage(void) {}
 #endif
 
+/* Approx more than twice max_threads */
+#define MAX_EXIT_CONCURRENT (1<<17)
+static atomic_t exit_concurrent_max = ATOMIC_INIT(MAX_EXIT_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(exit_wq);
+
 void __noreturn do_exit(long code)
 {
 	struct task_struct *tsk = current;
 	int group_dead;
 
+	if (atomic_dec_if_positive(&exit_concurrent_max) < 0) {
+		pr_warn_ratelimited("exit: exit_concurrent_max (%u) close to 0 (max : %u), throttling...",
+				    atomic_read(&exit_concurrent_max),
+				    MAX_EXIT_CONCURRENT);
+		wait_event(exit_wq,
+			   atomic_dec_if_positive(&exit_concurrent_max) >= 0);
+	}
+
 	/*
 	 * We can get here from a kernel oops, sometimes with preemption off.
 	 * Start by checking for critical errors.
@@ -881,6 +894,9 @@ void __noreturn do_exit(long code)
 
 	lockdep_free_task(tsk);
 	do_task_dead();
+
+	atomic_inc(&exit_concurrent_max);
+	wake_up(&exit_wq);
 }
 EXPORT_SYMBOL_GPL(do_exit);
 
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 4f5613dac227..980ffaba1ac5 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -238,6 +238,8 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
 		long max;
 		tns = iter->ns;
 		max = READ_ONCE(tns->ucount_max[type]);
+		if (atomic_long_read(&iter->ucount[type]) > max/16)
+			cond_resched();
 		if (!atomic_long_inc_below(&iter->ucount[type], max))
 			goto fail;
 	}
-- 
2.33.0

In my experimentation I saw we can easily have the above trigger 131072
concurrent exist waiting, which is twice /proc/sys/kernel/threads-max,
and at this point no OOM happens. But in reality I also saw us hitting
even 1048576, anything above 131072 starts causing tons of issues
(depending on the kernel) like OOMs or it being hard to bail on the
above shell loop.

  Luis