Re: INFO: task can't die in shrink_inactive_list (2)

Roman Gushchin <guro@xxxxxx> · Mon, 21 Dec 2020 13:25:19 -0800

On Mon, Dec 21, 2020 at 08:33:44PM +0000, Matthew Wilcox wrote:
> On Mon, Dec 21, 2020 at 11:56:36AM -0800, Hugh Dickins wrote:
> > On Mon, 23 Nov 2020, Andrew Morton wrote:
> > > On Fri, 20 Nov 2020 17:55:22 -0800 syzbot <syzbot+e5a33e700b1dd0da20a2@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > > 
> > > > Hello,
> > > > 
> > > > syzbot found the following issue on:
> > > > 
> > > > HEAD commit:    03430750 Add linux-next specific files for 20201116
> > > > git tree:       linux-next
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=13f80e5e500000 
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=a1c4c3f27041fdb8 
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=e5a33e700b1dd0da20a2 
> > > > compiler:       gcc (GCC) 10.1.0-syz 20200507
> > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12f7bc5a500000 
> > > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10934cf2500000 
> > > 
> > > Alex, your series "per memcg lru lock" changed the vmscan code rather a
> > > lot.  Could you please take a look at that reproducer?
> > 
> > Andrew, I promised I'd take a look at this syzreport too (though I think
> > we're agreed by now that it has nothing to do with per-memcg lru_lock).
> > 
> > I did try, but (unlike Alex) did not manage to get the reproducer to
> > reproduce it.  No doubt I did not try hard enough: I did rather lose
> > interest after seeing that it appears to involve someone with
> > CAP_SYS_ADMIN doing an absurdly large ioctl(BLKFRASET) on /dev/nullb0
> > ("Null test block driver" enabled via CONFIG_BLK_DEV_NULL_BLK=y: that I
> > did enable) and faulting from it: presumably triggering an absurd amount
> > of readahead.
> > 
> > Cc'ing Matthew since he has a particular interest in readahead, and
> > might be inspired to make some small safe change that would fix this,
> > and benefit realistic cases too; but on the whole it didn't look worth
> > worrying about - or at least not by me.
> 
> Oh, interesting.  Thanks for looping me in, I hadn't looked at this one
> at all.  Building on the debugging you did, this is the interesting
> part of the backtrace to me:
> 
> > > >  try_to_free_pages+0x29f/0x720 mm/vmscan.c:3264
> > > >  __perform_reclaim mm/page_alloc.c:4360 [inline]
> > > >  __alloc_pages_direct_reclaim mm/page_alloc.c:4381 [inline]
> > > >  __alloc_pages_slowpath.constprop.0+0x917/0x2510 mm/page_alloc.c:4785
> > > >  __alloc_pages_nodemask+0x5f0/0x730 mm/page_alloc.c:4995
> > > >  alloc_pages_current+0x191/0x2a0 mm/mempolicy.c:2271
> > > >  alloc_pages include/linux/gfp.h:547 [inline]
> > > >  __page_cache_alloc mm/filemap.c:977 [inline]
> > > >  __page_cache_alloc+0x2ce/0x360 mm/filemap.c:962
> > > >  page_cache_ra_unbounded+0x3a1/0x920 mm/readahead.c:216
> > > >  do_page_cache_ra+0xf9/0x140 mm/readahead.c:267
> > > >  do_sync_mmap_readahead mm/filemap.c:2721 [inline]
> > > >  filemap_fault+0x19d0/0x2940 mm/filemap.c:2809
> 
> So ra_pages has been set to something ridiculously large, and as
> a result, we call do_page_cache_ra() asking to read more memory than
> is available in the machine.  Funny thing, we actually have a function
> to prevent this kind of situation, and it's force_page_cache_ra().
> 
> So this might fix the problem.  I only tested that it compiles.  I'll
> be happy to write up a proper changelog and sign-off for it if it works ...
> it'd be good to get it some soak testing on a variety of different
> workloads; changing this stuff is enormously subtle.
> 
> As a testament to that, I think Fengguang got it wrong in commit
> 2cbea1d3ab11 -- async_size should have been 3 * ra_pages / 4, not ra_pages
> / 4 (because we read-behind by half the range, so we're looking for a
> page fault to happen a quarter of the way behind this fault ...)
> 
> This is partially Roman's fault, see commit 600e19afc5f8.

Hi Matthew,

Lol, I had a hard time to imagine how I managed to break the readahead
by my recent changes before looking into the commit:  it's from 2015 :)

I wonder how a __GFP_NORETRY allocation is causing a 143 seconds stall.
The loop in page_cache_ra_unbounded() should in theory be easily broken
on the first allocation failure. So it could be that (partially) because of
the unrealistically high ra limit the memory is becoming completely depleted
and the memory pressure is insane.

Anyway, your change looks good to me. I'll ack the full version.

Thanks!