On 04/25/2015 11:56, Joshua Kinard wrote: > I keep tripping up a BUG() in isolate_lru_pages in mm/vmscan.c:1345: > > switch (__isolate_lru_page(page, mode)) { > case 0: > nr_pages = hpage_nr_pages(page); > mem_cgroup_update_lru_size(lruvec, lru, -nr_pages); > list_move(&page->lru, dst); > nr_taken += nr_pages; > break; > > case -EBUSY: > /* else it is being freed elsewhere */ > list_move(&page->lru, src); > continue; > > default: > BUG(); > } > > This is on an SGI Onyx2 platform (MIPS, IP27), two node boards (4x R14000 > CPUs), and 8G of RAM. The problem appears tied to heavy disk I/O, typically > writes. I can reproduce sometimes with a long bonnie++ run, but I haven't > gotten a recent panic() message under 4.0 yet. Most of the time, it silently > hardlocks. I only have serial console access at 9600bps, so it may lock too > fast before the serial driver can dump the panic. > > Is there any information behind the purpose or triggers of this BUG()? I went > back in git all the way to the initial 2006 commit that added this function, > but could not find any comments or explanation of just what it's protecting > against. That makes it hard to know where to start debugging. > > I've already tried switching filesystems, first ext4, now XFS. Enabling > CONFIG_NUMA seems to make it harder to trigger, but that's not an objective > observation. An md RAID resync doesn't appear to trigger it either. This patch seems to explain things a little bit (from 20070316): http://marc.info/?l=linux-mm-commits&m=117401513810763&w=2 > Subject: lumpy: back out removal of active check in isolate_lru_pages > From: Andy Whitcroft <apw@xxxxxxxxxxxx> > > As pointed out by Christop Lameter it should not be possible for a page to > change its active/inactive state without taking the lru_lock. Reinstate this > safety net. > > Signed-off-by: Andy Whitcroft <apw@xxxxxxxxxxxx> > Acked-by: Mel Gorman <mel@xxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > --- > > mm/vmscan.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff -puN mm/vmscan.c~lumpy-back-out-removal-of-active-check-in-isolate_lru_pages mm/vmscan.c > --- a/mm/vmscan.c~lumpy-back-out-removal-of-active-check-in-isolate_lru_pages > +++ a/mm/vmscan.c > @@ -686,10 +686,13 @@ static unsigned long isolate_lru_pages(u > nr_taken++; > break; > > - default: > - /* page is being freed, or is a missmatch */ > + case -EBUSY: > + /* else it is being freed elsewhere */ > list_move(&page->lru, src); > continue; > + > + default: > + BUG(); > } > > if (!order) So if my reading is correct, the BUG() is being triggered because a page might be changing its active/inactive state w/o taking the lru_lock. Given that the SGI IP27 platform is an early NUMA machine and nodes can have a bit of physical distance between them (thus some latency), could this be a sign of some kind of SMP race condition specific to this platform? --J