Regression from "mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()" in stable kernels

Chen-Yu Tsai <wens@xxxxxxxxxx> · Thu, 12 Dec 2019 18:54:12 +0800

Hi,

I'd like to report a very severe performance regression due to

    mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels

in v4.19.88. I believe this was included since v4.19.67. It is also
in all the other LTS kernels, except 3.16.

So today I switched an x86_64 production server from v5.1.21 to
v4.19.88, because we kept hitting runaway kcompactd and kswapd.
Plus there was a significant increase in memory usage compared to
v5.1.5. I'm still bisecting that on another production server.

The service we run is one of the largest forums in Taiwan [1].
It is a terminal-based bulletin board system running over telnet,
SSH or a custom WebSocket bridge. The service itself is the
one-process-per-user type of design from the old days. This
means a lot of forks when there are user spikes or reconnections.

(Reconnections happen because a lot of people use mobile apps that
 wrap the service, but they get disconnected as soon as they are
 backgrounded.)

With v4.19.88 we saw a lot of contention on pgd_lock in the process
fork path with CONFIG_VMAP_STACK=y:

Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614
  Children      Self  Command          Shared Object                 Symbol
+   31.15%     0.03%  mbbsd            [kernel.kallsyms]
[k] entry_SYSCALL_64_after_hwframe
+   31.12%     0.02%  mbbsd            [kernel.kallsyms]
[k] do_syscall_64
+   28.12%     0.42%  mbbsd            [kernel.kallsyms]
[k] do_raw_spin_lock
-   27.70%    27.62%  mbbsd            [kernel.kallsyms]
[k] queued_spin_lock_slowpath
   - 18.73% __libc_fork
      - 18.33% entry_SYSCALL_64_after_hwframe
           do_syscall_64
         - _do_fork
            - 18.33% copy_process.part.64
               - 11.00% __vmalloc_node_range
                  - 10.93% sync_global_pgds_l4
                       do_raw_spin_lock
                       queued_spin_lock_slowpath
               - 7.27% mm_init.isra.59
                    pgd_alloc
                    do_raw_spin_lock
                    queued_spin_lock_slowpath
   - 8.68% 0x41fd89415541f689
      - __libc_start_main
         + 7.49% main
         + 0.90% main

This hit us pretty hard, with the service dropping below one-third
of its original capacity.

With CONFIG_VMAP_STACK=n, the fork code path skips this, but other
vmalloc users are still affected. One other area is the tty layer.
This also causes problems for us since there can be as many as 15k
users over SSH, some coming and going. So we got a lot of hung sshd
processes as well. Unfortunately I don't have any perf reports or
kernel logs to go with.

Now I understand that there is already a fix in -next:

    https://lore.kernel.org/patchwork/patch/1137341/

However the code has changed a lot in mainline and I'm not sure how
to backport this. For now I just reverted the commit by hand by
removing the offending code. Seems to work OK, and based on the commit
logs I guess it's safe to do so, as we're not running X86-32 or PTI.

Regards
ChenYu

[1] https://en.wikipedia.org/wiki/PTT_Bulletin_Board_System