The clone side contends against exit side in a way which avoidably exacerbates the problem by the latter waiting on locks held by the former while holding the tasklist_lock. Whacking this for both add_device_randomness and pids allocation gives me a 15% speed up for thread creation/destruction in a 24-core vm. The random patch is worth about 4%. nothing blew up with lockdep, lightly tested so far Bench (plop into will-it-scale): $ cat tests/threadspawn1.c char *testcase_description = "Thread creation and teardown"; static void *worker(void *arg) { return (NULL); } void testcase(unsigned long long *iterations, unsigned long nr) { pthread_t thread; int error; while (1) { error = pthread_create(&thread, NULL, worker, NULL); assert(error == 0); error = pthread_join(thread, NULL); assert(error == 0); (*iterations)++; } } v3: - keep procfs flush where it was, instead hoist get_pid outside of the lock - make detach_pid et al accept an array argument of pids to populate - sprinkle asserts - drop irq trips around pidmap_lock - move tty unref outside of tasklist_lock Mateusz Guzik (6): exit: perform add_device_randomness() without tasklist_lock exit: hoist get_pid() in release_task() outside of tasklist_lock exit: postpone tty_kref_put() until after tasklist_lock is dropped pid: sprinkle tasklist_lock asserts pid: perform free_pid() calls outside of tasklist_lock pid: drop irq disablement around pidmap_lock include/linux/pid.h | 7 ++-- kernel/exit.c | 45 +++++++++++++++---------- kernel/pid.c | 82 +++++++++++++++++++++++++-------------------- kernel/sys.c | 14 +++++--- 4 files changed, 86 insertions(+), 62 deletions(-) -- 2.43.0