On Thu, 28 Jun 2012 15:31:57 +0100 Mel Gorman <mgorman@xxxxxxx> wrote: > I am not on the linux-nfs list so I lack context about this issue so > take all this with a grain of salt. > > On Wed, Jun 20, 2012 at 07:14:22AM -0700, Jeff Layton wrote: > > On Fri, 15 Jun 2012 21:25:10 +0000 > > "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote: > > > > > On Fri, 2012-06-15 at 09:21 -0400, Jeff Layton wrote: > > > > On Fri, 15 Jun 2012 22:54:10 +1000 > > > > Harshula <harshula@xxxxxxxxxx> wrote: > > > > > > > > > Hi All, > > > > > > > > > > Can I please get your feedback on the following? > > > > > > > > > > rpciod is blocked: > > > > > ------------------------------------------------------------ > > > > > crash> bt 2507 > > > > > PID: 2507 TASK: ffff88103691ab40 CPU: 14 COMMAND: "rpciod/14" > > > > > #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9 > > > > > #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs] > > > > > #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f > > > > > #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8 > > > > > #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs] > > > > > #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs] > > > > > #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670 > > > > > #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271 > > > > > #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638 > > > > > #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f > > > > > #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e > > > > > #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f > > > > > #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad > > > > > #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942 > > > > > #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a > > > > > #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9 > > > > > #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b > > > > > #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808 > > > > > #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c > > > > > #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6 > > > > > #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7 > > > > > #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc] > > > > > #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc] > > > > > #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0 > > > > > #24 [ffff8810343bfee8] kthread at ffffffff8108dd96 > > > > > #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca > > > > > ------------------------------------------------------------ > > > > > > > > > > nfs_release_page() gives kswapd process an exemption from being blocked. > > > > > Should we do the same for rpciod. Otherwise rpciod could end up in a > > > > > deadlock where it can not continue without freeing memory that will only > > > > > become available when rpciod does its work: > > Typically processes don't get an exception as such. Filesystems and > callers involved in IO are meant to use GFP_NOFS and GFP_NOIO as > appropriate to stop reclaim causing them to block on themselves. > > > > > > ------------------------------------------------------------ > > > > > 479 /* > > > > > 480 * Attempt to release the private state associated with a page > > > > > 481 * - Called if either PG_private or PG_fscache is set on the page > > > > > 482 * - Caller holds page lock > > > > > 483 * - Return true (may release page) or false (may not) > > > > > 484 */ > > > > > 485 static int nfs_release_page(struct page *page, gfp_t gfp) > > > > > 486 { > > > > > 487 struct address_space *mapping = page->mapping; > > > > > 488 > > > > > 489 dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page); > > > > > 490 > > > > > 491 /* Only do I/O if gfp is a superset of GFP_KERNEL */ > > > > > 492 if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL) { > > > > > 493 int how = FLUSH_SYNC; > > > > > 494 > > > > > 495 /* Don't let kswapd deadlock waiting for OOM RPC calls */ > > > > > 496 if (current_is_kswapd()) > > > > > 497 how = 0; > > > > > 498 nfs_commit_inode(mapping->host, how); > > > > > 499 } > > > > > 500 /* If PagePrivate() is set, then the page is not freeable */ > > > > > 501 if (PagePrivate(page)) > > > > > 502 return 0; > > > > > 503 return nfs_fscache_release_page(page, gfp); > > > > > 504 } > > > > > ------------------------------------------------------------ > > > > > > > > > > Another option is to change the priority of the memory allocation: > > > > > net/ipv4/af_inet.c > > > > > ------------------------------------------------------------ > > > > > 265 static int inet_create(struct net *net, struct socket *sock, int > > > > > protocol, > > > > > 266 int kern) > > > > > 267 { > > > > > ... > > > > > 345 sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot); > > > > > ------------------------------------------------------------ > > > > > Considering this is generic net code, I assume the GFP_KERNEL will not > > > > > be replaced with GFP_ATOMIC. > > > > > > > > > > NOTE, this is on RHEL 6.1 kernel 2.6.32-131.6.1 and apparently uses > > > > > 'legacy' workqueue code. > > > > > > > > > > cya, > > > > > # > > > > > > > > > > > > > I suspect this is also a problem in mainline, but maybe some of the > > > > recent writeback changes prevent it... > > > > > > > > I think the right solution here is to make nfs_release_page treat rpciod > > > > similarly to kswapd. Easier said than done though -- you'll need to > > > > come up with a way to determine if you're running in rpciod context... > > > > > > No. The _right_ solution is to ensure that rpciod doesn't do allocations > > > that result in a page reclaim... try_to_release_page() is just the tip > > > of the iceberg of crazy deadlocks that this socket allocation can get us > > > into. > > > > > Fully agreed. Callers that cannot dispatch IO or do filesystem based > operations should not specify __GFP_FS or __GFP_IO. > > > > Unfortunately, selinux & co. prevent us from allocating the sockets in > > > user contexts, and anyway, having to wait for another thread to do the > > > same allocation isn't doing to help prevent the deadlock... > > > > > > > My thinking was that we could return without releasing the page, and > > eventually kmalloc would get to one that was able to be reclaimed via > > other means. That's probably too optimistic though... > > > > This is what is meant to happen if you use GFP flags appropriately. The > pages that cannot be reclaimed are skipped over until a clean page can > be discarded for example. In some cases it might still be possible to > swap out a page. > > > > I know that Mel Gorman's NFS swap patches had some protections against > > > this sort of deadlock. Perhaps we can look at how he was doing this? > > > > > > > (cc'ing Mel in case he has thoughts) > > > > This is not what those patches are for. They are specific to the case > where the network traffic is related to swapping. It allows temporary > overriding of the watermarks to receive and send packets related to swap > and discards all others. It is not a general solution for the sort of > problem you are seeing. > > > I looked over Mel's patches and they mainly seem to affect how memory > > for the socket is handled after it has been created. In this case, we're > > hanging on the allocation of a new socket altogether. > > > > It does do this before calling xs_create_sock however: > > > > + if (xprt->swapper) > > + current->flags |= PF_MEMALLOC; > > + > > > > Perhaps we should just set PF_MEMALLOC unconditionally there? > > No, that will cause the system to deadlock under very heavy memory > pressure. It's not always easy to trigger such a situation but it is always > the risk with PF_MEMALLOC is abused. > > > We might > > still need Mel's other patches that add SOCK_MEMALLOC to ensure that it > > can actually be connected, but that's a bit of a different problem. > > > > Do not do this. The SOCK_MEMALLOC stuff is specific to swapping. It is > not a substitute to avoid using the correct flags. Abusing PF_MEMALLOC > leads to deadlock. > Ok, thanks for having a look. Sorry to add you in the middle of the thread, that does make it a bit difficult to understand what's happening. It would be nice if we could pass GFP flags appropriately here, but the problem is that the GFP flags for the socket allocation are buried way down deep inside of inet_create(). See the sk_alloc() call in there. Now, we could fix it up so that __sock_create passes a set of GFP flags all the way down to that point, but even that won't really save us here in all cases. There's a sock_alloc() call in there which will need an inode. __sock_create also does a request_module() and that can certainly require memory too. Good luck getting that to take GFP flags. The right approach I think is to do what Trond suggested an an earlier mail and leverage the PF_FSTRANS flag... -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html