On Fri, Jan 07, 2011 at 02:52:59PM +0000, Mel Gorman wrote: > On Thu, Jan 06, 2011 at 05:20:25PM -0800, Hugh Dickins wrote: > > Hi Mel, > > > > Here are the traces of two concurrent "cp -a kerneltree elsewhere"s > > which have hung on mmotm: in limited RAM, on a PowerPC > > How limited in RAM and how many CPUs? > > > - I don't have > > an explanation for why I can reproduce it in minutes on that box but > > never yet on the x86s. > > > > Strongest bet is simply that compaction is not triggering for you on the > x86 boxes. Monitor "grep compact /proc/vmstat" on the two machines and > see if the counters are growing on powerpc and not on x86. I'm trying to > reproduce the problem locally but no luck yet. > > > Perhaps we can get it to happen with just one cp: the second cp here > > seemed to be deadlocking itself, unmap_and_move()'s force lock_page > > waiting on a page which its page cache readahead already holds locked. > > > > cp D 000000000fea3110 0 18874 18873 0x00008010 > > Call Trace: > > .__switch_to+0xcc/0x110 > > .schedule+0x670/0x7b0 > > .io_schedule+0x50/0x8c > > .sync_page+0x84/0xa0 > > .sync_page_killable+0x10/0x48 > > .__wait_on_bit_lock+0x9c/0x140 > > .__lock_page_killable+0x74/0x98 > > .do_generic_file_read+0x2b0/0x504 > > .generic_file_aio_read+0x214/0x29c > > .do_sync_read+0xac/0x10c > > .vfs_read+0xd0/0x1a0 > > .SyS_read+0x58/0xa0 > > syscall_exit+0x0/0x40 > > syscall_exit+0x0/0x40 > > > > cp D 000000000fea3110 0 18876 18875 0x00008010 > > Call Trace: > > 0xc000000001343b68 (unreliable) > > .__switch_to+0xcc/0x110 > > .schedule+0x670/0x7b0 > > .io_schedule+0x50/0x8c > > .sync_page+0x84/0xa0 > > .__wait_on_bit_lock+0x9c/0x140 > > .__lock_page+0x74/0x98 > > .unmap_and_move+0xfc/0x380 > > .migrate_pages+0xbc/0x18c > > .compact_zone+0xbc/0x400 > > .compact_zone_order+0xc8/0xf4 > > .try_to_compact_pages+0x104/0x1b8 > > .__alloc_pages_direct_compact+0xa8/0x228 > > .__alloc_pages_nodemask+0x42c/0x730 > > .allocate_slab+0x84/0x168 > > .new_slab+0x58/0x198 > > .__slab_alloc+0x1ec/0x430 > > .kmem_cache_alloc+0x7c/0xe0 > > .radix_tree_preload+0x94/0x140 > > .add_to_page_cache_locked+0x70/0x1f0 > > .add_to_page_cache_lru+0x50/0xac > > .mpage_readpages+0xcc/0x198 > > .ext3_readpages+0x28/0x40 > > .__do_page_cache_readahead+0x1ac/0x2ac > > .ra_submit+0x28/0x38 > > Something is odd right here. I would have expected entries in the > calli stack containing > > .ondemand_readahead > .page_cache_sync_readahead > > I am going to have to assume these functions were really called > otherwise the ra_submit is a mystery :( > > > .do_generic_file_read+0xe8/0x504 > > .generic_file_aio_read+0x214/0x29c > > .do_sync_read+0xac/0x10c > > .vfs_read+0xd0/0x1a0 > > .SyS_read+0x58/0xa0 > > syscall_exit+0x0/0x40 > > > > I haven't made a patch for it, just hacked unmap_and_move() to say > > "if (!0)" instead of "if (!force)" to get on with my testing. I expect > > you'll want to pass another arg down to migrate_pages() to prevent > > setting force, or give it some flags, or do something with PF_MEMALLOC. > > > > I tend to agree but I'm failing to see how it might be happening right now. > The callchain looks something like > > do_generic_file_read > # page is not found > page_cache_sync_readahead > ondemand_readahead > ra_submit > __do_page_cache_readahead > # Allocates a bunch of pages > # Sets PageReadahead. Otherwise the pages initialised > # and they are not on the LRU yet > read_pages > # Calls mapping->readpages which calls mpage_readpages > mpage_readpages > # For the list of pages (index initialised), add > # each of them to the LRU. Adding to the LRU > # locks the page and should return the page > # locked. > add_to_page_cache_lru > # sets PageSwapBacked > add_to_page_cache > # locks page > add_to_page_cache_locked > # preloads radix tree > radix_tree_preload > # DEADLOCK HERE. This is what does not > # make sense. Compaction could not be > # be finding the page on the LRU as > # lru_cache_add_file() has not been > # called yet for this page > > So I don't think we are locking on the same page. > > Here is a possibility. mpage_readpages() is reading ahead so there are obviously > pages that are not Uptodate. It queues these for asynchronous read with > block_read_full_page(), returns and adds the page to the LRU (so compaction > is now able to find it). IO starts at some time in the future with the page > still locked and gets unlocked at the end of IO by end_buffer_async_read(). > > Between when IO is queued and it completes, a new page is being added to > the LRU, the radix tree is loaded and compaction kicks off trying to > lock the same page that is not up to date yet. Something is preventing > the IO completing and the page being unlocked but I'm missing what that > might be. > > Does this sound plausible? I'll keep looking but I wanted to see if > someone spotted quickly a major flaw in the reasoning or have a quick > guess as to why the page might not be getting unlocked at the end of IO > properly. > I've still not figured out why the pages are not getting unlocked again but can you try the following bodge patch to confirm it's something to do with !Uptodate pages please? ==== CUT HERE ==== diff --git a/mm/migrate.c b/mm/migrate.c index e12f717..66a2aac 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -612,6 +612,23 @@ static int move_to_new_page(struct page *newpage, struct page *page, } /* + * HACK + * Do not allow direct compaction to migrate !Uptodate pages. It is possible + * the caller has locked the page for IO to complete and for some reason the + * IO is not completing and unlocking the page leading to deadlock + */ +static inline bool safe_to_lock_page(struct page *page) +{ + if (!(current->flags & PF_MEMALLOC)) + return true; + + if (!PageUptodate(page)) + return false; + + return true; +} + +/* * Obtain the lock on page, remove all ptes and migrate the page * to the newly allocated page in newpage. */ @@ -642,7 +659,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, rc = -EAGAIN; if (!trylock_page(page)) { - if (!force) + if (!force || !safe_to_lock_page(page)) goto move_newpage; lock_page(page); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>