On 07/21/2018 10:39 AM, Daniel Jordan wrote:
On 07/20/2018 04:19 AM, john terragon wrote:
On Friday, July 20, 2018, 2:03:48 AM GMT+2, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx <bugzilla-daemon@xxxxxxxxxxxxxxxxxxx> wrote:
>
>https://bugzilla.kernel.org/show_bug.cgi?id=200105 <https://bugzilla.kernel.org/show_bug.cgi?id=200105>
>
>--- Comment #42 from Andrew Morton (akpm@xxxxxxxxxxxxxxxxxxxx <mailto:akpm@xxxxxxxxxxxxxxxxxxxx>) ---
>Sorry, but nobody reads bugzilla. I tried to switch this discussion to an
>email thread for a reason!
>
>Please resend all this (useful) info in reply to the email thread which I
>created for this purpose.
I'll resend the last message and attachments. Anyone interested on the previous "episodes" go read
https://bugzilla.kernel.org/show_bug.cgi?id=200105
The summary is that John has put together a reliable reproducer for a problem he's seeing where on high memory usage any of his desktop systems with SSDs hang for around a minute, completely unresponsive, and swaps out 2-3x more memory than the system is allocating.
John's issue only happens using a LUKS encrypted swap partition, unencrypted swap or swap encrypted without LUKS works fine.
In one test (out5.txt) where most system memory is taken by anon pages beforehand, the heavy direct reclaim that Michal noticed lasts for 24 seconds, during which on average if I've crunched my numbers right, John's test program was allocating at 4MiB/s, the system overall (pgalloc_normal) was allocating at 235MiB/s, and the system was swapping out (pswpout) at 673MiB/s. pgalloc_normal and pswpout stay roughly the same each second, no big swings.
Is the disparity between allocation and swapout rate expected?
John ran perf during another test right before the last test program was started (this doesn't include the initial large allocation bringing the system close to swapping). The top five allocators (kmem:mm_page_alloc):
# Overhead Pid:Command
# ........ .......................
#
48.45% 2005:memeater # the test program
32.08% 73:kswapd0
3.16% 1957:perf_4.17
1.41% 1748:watch
1.16% 2043:free
So it seems to be just reclaim activity, but why so much when the test program only allocates at 4MiB/s?
Should add that during the 24 seconds, reclaim efficiency for both kswapd and direct (pgsteal/pgscan) hovered around 1%, which seems low.
The 24 seconds cover =S 1530092789 to =S 1530092812 in out5.txt from bugzilla.