Re: [Bug 204165] New: 100% CPU usage in compact_zone_order

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Thu, 18 Jul 2019 09:37:05 +0100

On Wed, Jul 17, 2019 at 10:00:18PM +0000, howaboutsynergy@xxxxxxxxxxxxxx wrote:
> tl;dr: patch seems to work, thank you very much!
> 

\o/

> ????????????????????? Original Message ?????????????????????
> On Wednesday, July 17, 2019 7:53 PM, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> 
> > Ok, great. From the trace, it was obvious that the scanner is making no
> > progress. I don't think zswap is involved as such but it may be making
> > it easier to trigger due to altering timing. At least, I see no reason
> > why zswap would materially affect the termination conditions.
>
> I don't know if it matters in this context, but I've been using the term
> `zswap`(somewhere else I think) to (wrongly)refer to swap in zram (and
> even sometimes called it ext4(in this bug report too) without realizing
> at the time that ext4 is only for /tmp and /var/tmp instead! they are
> ext4 in zram) but in fact this isn't zswap that I have been using (even
> though I have CONFIG_ZSWAP=y in .config) but it's just CONFIG_ZRAM=y with
> CONFIG_SWAP=y (and probably a bunch of others being needed too).

For other bugs, it would matter a lot. In this specific case, all that
was altered was the timing. With enough effort I was able to reproduce
the problem reliably within 30 minutes. With the patch, it ran without
hitting the problem for hours. I had tracing patches applied to log when
the specific problem occurred.

Without the patch, when the problem occurred, it hit 197009 times in a tight
loop driving up CPU usage. With the patch, the same condition was hit 4
times and in each case the task exited immediately as expected so I
think we're good.

> > a proper abort. I think it ends up looping in compaction instead of dying
> > without either aborting or progressing the scanner. It might explain why
> > stress-ng is hitting is as it is probably sending fatal signals on timeout
> > (I didn't check the source).
> Ah I didn't know there are multiple `stress` versions, here's what I used:
> 
> /usr/bin/stress is owned by stress 1.0.4-5
> 

Fortunately, they were all equivalent. The key was getting a task to
receive a fatal signal while compacting and while scanning a PFN that
was not aligned to SWAP_CLUSTER_MAX. Tricky to hit but fortunately your
test case was exactly what we needed.

> > <SNIP>
> 
> Now the "problem" is I can't tell if it would get stuck :D but it usually ends in no more than 17 sec:
> $ time stress -m 220 --vm-bytes 10000000000 --timeout 10
> stress: info: [7981] dispatching hogs: 0 cpu, 0 io, 220 vm, 0 hdd
> stress: FAIL: [7981] (415) <-- worker 8202 got signal 9
> stress: WARN: [7981] (417) now reaping child worker processes
> stress: FAIL: [7981] (415) <-- worker 8199 got signal 9
> stress: WARN: [7981] (417) now reaping child worker processes
> stress: FAIL: [7981] (451) failed run completed in 18s
> 

That's fine. Fortunately my own debug tracing added based on your trace
gives me enough confidence.

> <SNIP>
>
> (probably irrelevant)Sometimes Xorg says it can't allocate any more memory but stacktrace looks like it's inside some zram i915 kernel stuff:
> 
> [ 1416.842931] [drm] Atomic update on pipe (A) took 188 us, max time under evasion is 100 us
> [ 1425.416979] Xorg: page allocation failure: order:0, mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE), nodemask=(null),cpuset=/,mems_allowed=0
> [ 1425.416984] CPU: 1 PID: 1024 Comm: Xorg Kdump: loaded Tainted: G     U            5.2.1-g527a3db363a3 #74
> [ 1425.416985] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019
> [ 1425.416986] Call Trace:

So this looks like the system is still under a lot of stress trying to
swap. It's unfortunate but unrelated and relatively benign given the
level of stress the system is under.

> <SNIP>
>
> But anyway, since last time I was able to trigger it with the normal(timeout 10) command on the third try, I've decided to keep trying that:
> after 170 more tries via `$ while true; do time stress -m 220 --vm-bytes 10000000000 --timeout 10; done`
> I saw no hangs or any runs taking more time than the usual 16-26 sec.
> So the patch must be working as intended. Thanks very much. Let me know if you want me to do anything else.
> 

Perfect. Thanks a million for your patience, testing and tracing (the
tracing pinpointed exactly where I needed to look -- of 3 potential
candidates, only one really made sense). I'll put a proper changelog on
this and send it out where it should get picked up for 5.3 and -stable.

-- 
Mel Gorman
SUSE Labs