MIPS: BUG() in isolate_lru_pages in mm/vmscan.c?

Joshua Kinard <kumba@xxxxxxxxxx> · Sat, 25 Apr 2015 11:56:12 -0400

I keep tripping up a BUG() in isolate_lru_pages in mm/vmscan.c:1345:

	switch (__isolate_lru_page(page, mode)) {
	case 0:
		nr_pages = hpage_nr_pages(page);
		mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
		list_move(&page->lru, dst);
		nr_taken += nr_pages;
		break;

	case -EBUSY:
		/* else it is being freed elsewhere */
		list_move(&page->lru, src);
		continue;

	default:
		BUG();
	}

This is on an SGI Onyx2 platform (MIPS, IP27), two node boards (4x R14000
CPUs), and 8G of RAM.  The problem appears tied to heavy disk I/O, typically
writes.  I can reproduce sometimes with a long bonnie++ run, but I haven't
gotten a recent panic() message under 4.0 yet.  Most of the time, it silently
hardlocks.  I only have serial console access at 9600bps, so it may lock too
fast before the serial driver can dump the panic.

Is there any information behind the purpose or triggers of this BUG()?  I went
back in git all the way to the initial 2006 commit that added this function,
but could not find any comments or explanation of just what it's protecting
against.  That makes it hard to know where to start debugging.

I've already tried switching filesystems, first ext4, now XFS.  Enabling
CONFIG_NUMA seems to make it harder to trigger, but that's not an objective
observation.  An md RAID resync doesn't appear to trigger it either.

Help?