[LSF/MM/BPF TOPIC] kernel multithreading with padata

Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> · Wed, 12 Feb 2020 17:47:31 -0500

padata has been undergoing some surgery over the last year[0] and now seems
ready for another enhancement: splitting up and multithreading CPU-intensive
kernel work.

Quoting from an earlier series[1], the problem I'm trying to solve is

  A single CPU can spend an excessive amount of time in the kernel operating
  on large amounts of data.  Often these situations arise during initialization-
  and destruction-related tasks, where the data involved scales with system
  size.  These long-running jobs can slow startup and shutdown of applications
  and the system itself while extra CPUs sit idle.

Here are the current consumers:

 - struct page init (boot, hotplug, pmem)
 - VFIO page pinning (kvm guest init)
 - fallocating a hugetlb file (database shared memory init)

On a large-memory server, DRAM page init is ~23% of kernel boot (3.5s/15.2s),
and it takes over a minute to start a VFIO-enabled kvm guest or fallocate a
hugetlb file that occupy a significant fraction of memory.  This work results
in 7-20x speedups and is currently increasing the uptime of our production
kernels.

Future areas include munmap/exit, umount, and __ib_umem_release.  Some of these
need coarse locks broken up for multithreading (zone->lock, lru_lock).

Positive outcomes for the session would be...

 - Finding a strategy for capping the maximum number of threads in a job.

 - Agreeing on a way for the job's threads to respect resource controls.

   In the past few weeks I've been thinking about whether remote charging
   in the CPU controller is feasible (RFD to come), am also considering creating
   workqueue workers directly in cgroup-specific pools instead, and have
   proposed migrating workers in and out of cgroups before[2].  There's also
   memory policy and sched_setaffinity() to think about.

 - Checking the overall design of this thing with the mm community, given that
   current users are all mm-related.

 - Getting advice from others (hallway track) on why some pmem devices
   perform better than others under multithreading.

This work-in-progress branch shows what it looks like now.

    git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.2
    https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.2

[0] https://lore.kernel.org/linux-crypto/?q=s%3Apadata+d%3A20190212..20200212
[1] https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@xxxxxxxxxx/
[2] https://lore.kernel.org/lkml/20190605133650.28545-1-daniel.m.jordan@xxxxxxxxxx/