Re: hunting an IO hang

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 17 Jan 2011 17:32:22 +0100

On Mon, Jan 17, 2011 at 10:02:47AM -0500, Chris Mason wrote:
> Excerpts from Chris Mason's message of 2011-01-17 09:07:40 -0500:
> 
> [ various crashes under load with current git ]
> 
> > 
> > I did have CONFIG_COMPACTION off for my latest reproduce.  The last two
> > have been corruption on the page->lru lists, maybe that'll help narrow
> > our bisect pool down.
> 
> I've reverted 744ed1442757767ffede5008bb13e0805085902e, and
> d8505dee1a87b8d41b9c4ee1325cd72258226fbc and the run has lasted longer
> than any runs in the past.
> 
> I'll give this a few hours but they seem the most related to my various
> crashes so far.

I went through the new batched activation code.  Shaohua, can you
explain to me why the following sequence is not possible?

1. CPU A and B schedule activation of a page (PG_lru && !PG_active)
2. CPU A flushes the page to the active list (PG_lru && PG_active)
3. CPU A isolates the page for scanning/migration and
   puts it on private list (!PG_lru && PG_active)
4. CPU B flushes the page to the active list (!PG_lru && PG_active),
   the deferred activation code now assumes putback mode and adds the page
   to the active list, thus corrupting the link to the private list of CPU A
5. CPU A does list_del() from the private list (like unmap_and_move() does)
   and trips up on the corruption

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>