On Thu, Dec 12, 2019 at 07:31:54PM +0800, Chen-Yu Tsai wrote: > On Thu, Dec 12, 2019 at 7:19 PM Greg Kroah-Hartman > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > > > On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote: > > > Hi, > > > > > > I'd like to report a very severe performance regression due to > > > > > > mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels > > > > > > in v4.19.88. I believe this was included since v4.19.67. It is also > > > in all the other LTS kernels, except 3.16. > > > > > > So today I switched an x86_64 production server from v5.1.21 to > > > v4.19.88, because we kept hitting runaway kcompactd and kswapd. > > > Plus there was a significant increase in memory usage compared to > > > v5.1.5. I'm still bisecting that on another production server. > > > > > > The service we run is one of the largest forums in Taiwan [1]. > > > It is a terminal-based bulletin board system running over telnet, > > > SSH or a custom WebSocket bridge. The service itself is the > > > one-process-per-user type of design from the old days. This > > > means a lot of forks when there are user spikes or reconnections. > > > > > > (Reconnections happen because a lot of people use mobile apps that > > > wrap the service, but they get disconnected as soon as they are > > > backgrounded.) > > > > > > With v4.19.88 we saw a lot of contention on pgd_lock in the process > > > fork path with CONFIG_VMAP_STACK=y: > > > > > > Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 > > > Children Self Command Shared Object Symbol > > > + 31.15% 0.03% mbbsd [kernel.kallsyms] > > > [k] entry_SYSCALL_64_after_hwframe > > > + 31.12% 0.02% mbbsd [kernel.kallsyms] > > > [k] do_syscall_64 > > > + 28.12% 0.42% mbbsd [kernel.kallsyms] > > > [k] do_raw_spin_lock > > > - 27.70% 27.62% mbbsd [kernel.kallsyms] > > > [k] queued_spin_lock_slowpath > > > - 18.73% __libc_fork > > > - 18.33% entry_SYSCALL_64_after_hwframe > > > do_syscall_64 > > > - _do_fork > > > - 18.33% copy_process.part.64 > > > - 11.00% __vmalloc_node_range > > > - 10.93% sync_global_pgds_l4 > > > do_raw_spin_lock > > > queued_spin_lock_slowpath > > > - 7.27% mm_init.isra.59 > > > pgd_alloc > > > do_raw_spin_lock > > > queued_spin_lock_slowpath > > > - 8.68% 0x41fd89415541f689 > > > - __libc_start_main > > > + 7.49% main > > > + 0.90% main > > > > > > This hit us pretty hard, with the service dropping below one-third > > > of its original capacity. > > > > > > With CONFIG_VMAP_STACK=n, the fork code path skips this, but other > > > vmalloc users are still affected. One other area is the tty layer. > > > This also causes problems for us since there can be as many as 15k > > > users over SSH, some coming and going. So we got a lot of hung sshd > > > processes as well. Unfortunately I don't have any perf reports or > > > kernel logs to go with. > > > > > > Now I understand that there is already a fix in -next: > > > > > > https://lore.kernel.org/patchwork/patch/1137341/ > > > > > > However the code has changed a lot in mainline and I'm not sure how > > > to backport this. For now I just reverted the commit by hand by > > > removing the offending code. Seems to work OK, and based on the commit > > > logs I guess it's safe to do so, as we're not running X86-32 or PTI. > > > > The above commit should resolve the issue for you, can you try it out on > > 5.4? And any reason you have to stick with the old 4.19 kernel? > > We typically run new kernels on the other server (the one I'm currently > doing git bisect on) for a couple weeks before running it on our main > server. That one doesn't see nearly as much load though. Also because > of the increased memory usage I was seeing in 5.1.21, I wasn't > particularly comfortable going directly to 5.4. > > I suppose the reason for being overly cautious is that the server is a > pain to reboot. The service is monolithic, running on just the one server. > And any significant downtime _always_ hits the local newspapers. Combined > with the upcoming election, conspiracy theories start flying around. :( > Now that it looks stable, we probably won't be testing anything new until > mid-January. Fair enough, good luck! greg k-h