Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd

Hongchen Zhang <zhanghongchen@xxxxxxxxxxx> · Fri, 16 Sep 2022 18:19:55 +0800

Hi Andrew and Matthew,

On 2022/9/16 pm 4:40, Matthew Wilcox wrote:
On Fri, Sep 16, 2022 at 08:57:50AM +0800, Hongchen Zhang wrote:
Hi Andrew ,

On 2022/9/15 pm 5:00, Matthew Wilcox wrote:
On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
Hi Matthew,
On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
[ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
seconds.
[ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
[ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
77327 flags:0x00004002
[ 3748.483358] Call Trace:
[ 3748.485964]  <TASK>
[ 3748.488150]  __schedule+0x23d/0x590
[ 3748.491804]  schedule+0x4e/0xc0
[ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
[ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
[ 3748.505181]  down_read+0x43/0xa0
[ 3748.508518]  do_user_addr_fault+0x41c/0x670
[ 3748.512799]  exc_page_fault+0x77/0x170
[ 3748.516673]  asm_exc_page_fault+0x26/0x30
[ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
[ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
[ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
[ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
0000000000000010
[ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
00007f99faa1a000
[ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
0000000000000014
[ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
0000000000001000
[ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
ffffaa9248fffd98
[ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
[ 3748.592614]  filemap_read+0x174/0x3e0

Interesting; it wasn't the process itself which triggered the page
fault; the process called read() and the kernel took the page fault to
satisfy the read() call.

[ 3748.596354]  ? ima_file_check+0x6a/0xa0
[ 3748.600301]  generic_file_read_iter+0xe5/0x150
[ 3748.604884]  ext4_file_read_iter+0x5b/0x190
[ 3748.609164]  ? aa_file_perm+0x102/0x250
[ 3748.613125]  new_sync_read+0x10d/0x1a0
[ 3748.617009]  vfs_read+0x103/0x1a0
[ 3748.620423]  ksys_read+0x67/0xf0
[ 3748.623743]  __x64_sys_read+0x19/0x20
[ 3748.627511]  do_syscall_64+0x59/0xc0
[ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
[ 3748.636144]  ? do_syscall_64+0x69/0xc0
[ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
[ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
[ 3748.649872]  ? irqentry_exit+0x1d/0x30
[ 3748.653737]  ? exc_page_fault+0x89/0x170
[ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 3748.663030] RIP: 0033:0x7f9a852989cc
[ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
00007f9a852989cc
[ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
0000000000000061
[ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
00007f99faa18010
[ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
00007f99faa18010
[ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
0000000000027100

ORIG_RAX is 0, which matches sys_read.
RDI is file descriptor 0x61
RSI is plausibly a userspace pointer, 0x7f99faa18010
RDX is the length, 0x27100 or 160kB.

That all seems reasonable.

What I really want to know is who is _holding_ the lock.  We stash
a pointer to the task_struct in 'owner', so we could clearly find this
out in the 'blocked for too long' report, and print their stack trace.

As described in the comment for __rwsem_set_reader_owned,it is hard to track
read owners.So we could not clearly find out who blocked the process,it was
caused by multiple tasks.

Readers don't block readers.  You have a reader here, so it's being
blocked by a writer.  And that writer's task_struct is stashed in
rwsem->owner.  It would be nice if we dumped that information
automatically ... but we don't do that today.  Perhaps you could
grab that information from a crash dump if you have one.

You must have done something like this already in order to deduce that
it was the direct reclaim path that was the problem?

The method we used is to track the direct reclaim using the
trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
occurred,we could get a very large "nr_reclaimed" which is not a desirable
value for process except kswapd.

I disagree.  If a process needs to allocate memory then it should be
paying the cost of reclaiming that memory itself.  kswapd is a last
resort to reclaim memory when we have a workload (eg a network router)
that does its memory allocation primarily in interrupt context.

What's your opinion about this scan adjust issue? Is there a better way to
fix this issue?

Yes, but we need you to gather more information about what's causing
the issue before we can suggest what that is.

I think the following scenery triggers the scan adjust issue:
In function shrink_lruvec, we call get_scan_count and get the following 
values:
targets[LRU_INACTIVE_ANON]=50000
targets[LRU_ACTIVE_ANON]=50000
targets[LRU_INACTIVE_FILE]=128
targets[LRU_ACTIVE_FILE]=129

After the first scan, we get more than nr_to_reclaim pages, but the 
percentage of scanning nr[LRU_INACTIVE_FILE+LRU_ACTIVE_FILE] is 256/257, 

Then when we scan adjust, we must scan(possibly reclaim) 
256*(50000+50000)/257-256=99354 pages, which is too large and would 
waste too many time.
If it is not kswapd, it is unacceptable to reclaim so many pages.
So we should limit the number of pages of scan adjust.

Thanks
Hongchen Zhang