On Wed, 27 Jun 2012 18:43:56 +0000 "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote: > On Wed, 2012-06-27 at 11:54 -0400, Jeff Layton wrote: > > On Fri, 15 Jun 2012 21:25:10 +0000 > > "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote: > > > > > On Fri, 2012-06-15 at 09:21 -0400, Jeff Layton wrote: > > > > On Fri, 15 Jun 2012 22:54:10 +1000 > > > > Harshula <harshula@xxxxxxxxxx> wrote: > > > > > > > > > Hi All, > > > > > > > > > > Can I please get your feedback on the following? > > > > > > > > > > rpciod is blocked: > > > > > ------------------------------------------------------------ > > > > > crash> bt 2507 > > > > > PID: 2507 TASK: ffff88103691ab40 CPU: 14 COMMAND: "rpciod/14" > > > > > #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9 > > > > > #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs] > > > > > #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f > > > > > #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8 > > > > > #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs] > > > > > #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs] > > > > > #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670 > > > > > #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271 > > > > > #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638 > > > > > #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f > > > > > #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e > > > > > #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f > > > > > #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad > > > > > #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942 > > > > > #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a > > > > > #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9 > > > > > #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b > > > > > #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808 > > > > > #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c > > > > > #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6 > > > > > #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7 > > > > > #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc] > > > > > #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc] > > > > > #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0 > > > > > #24 [ffff8810343bfee8] kthread at ffffffff8108dd96 > > > > > #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca > > > > > ------------------------------------------------------------ > > > > > > > > > > nfs_release_page() gives kswapd process an exemption from being blocked. > > > > > Should we do the same for rpciod. Otherwise rpciod could end up in a > > > > > deadlock where it can not continue without freeing memory that will only > > > > > become available when rpciod does its work: > > > > > ------------------------------------------------------------ > > > > > 479 /* > > > > > 480 * Attempt to release the private state associated with a page > > > > > 481 * - Called if either PG_private or PG_fscache is set on the page > > > > > 482 * - Caller holds page lock > > > > > 483 * - Return true (may release page) or false (may not) > > > > > 484 */ > > > > > 485 static int nfs_release_page(struct page *page, gfp_t gfp) > > > > > 486 { > > > > > 487 struct address_space *mapping = page->mapping; > > > > > 488 > > > > > 489 dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page); > > > > > 490 > > > > > 491 /* Only do I/O if gfp is a superset of GFP_KERNEL */ > > > > > 492 if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL) { > > > > > 493 int how = FLUSH_SYNC; > > > > > 494 > > > > > 495 /* Don't let kswapd deadlock waiting for OOM RPC calls */ > > > > > 496 if (current_is_kswapd()) > > > > > 497 how = 0; > > > > > 498 nfs_commit_inode(mapping->host, how); > > > > > 499 } > > > > > 500 /* If PagePrivate() is set, then the page is not freeable */ > > > > > 501 if (PagePrivate(page)) > > > > > 502 return 0; > > > > > 503 return nfs_fscache_release_page(page, gfp); > > > > > 504 } > > > > > ------------------------------------------------------------ > > > > > > > > > > Another option is to change the priority of the memory allocation: > > > > > net/ipv4/af_inet.c > > > > > ------------------------------------------------------------ > > > > > 265 static int inet_create(struct net *net, struct socket *sock, int > > > > > protocol, > > > > > 266 int kern) > > > > > 267 { > > > > > ... > > > > > 345 sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot); > > > > > ------------------------------------------------------------ > > > > > Considering this is generic net code, I assume the GFP_KERNEL will not > > > > > be replaced with GFP_ATOMIC. > > > > > > > > > > NOTE, this is on RHEL 6.1 kernel 2.6.32-131.6.1 and apparently uses > > > > > 'legacy' workqueue code. > > > > > > > > > > cya, > > > > > # > > > > > > > > > > > > > I suspect this is also a problem in mainline, but maybe some of the > > > > recent writeback changes prevent it... > > > > > > > > I think the right solution here is to make nfs_release_page treat rpciod > > > > similarly to kswapd. Easier said than done though -- you'll need to > > > > come up with a way to determine if you're running in rpciod context... > > > > > > No. The _right_ solution is to ensure that rpciod doesn't do allocations > > > that result in a page reclaim... try_to_release_page() is just the tip > > > of the iceberg of crazy deadlocks that this socket allocation can get us > > > into. > > > > > > Unfortunately, selinux & co. prevent us from allocating the sockets in > > > user contexts, and anyway, having to wait for another thread to do the > > > same allocation isn't doing to help prevent the deadlock... > > > > > > I know that Mel Gorman's NFS swap patches had some protections against > > > this sort of deadlock. Perhaps we can look at how he was doing this? > > > > > > > Actually...now that I've looked at this, I think the right solution > > here is to stop calling sock_release() on these sockets until we're in > > xs_destroy. IOW, don't free the socket altogether unless we're truly > > tearing down the connection for good. That means though that we need to > > convert the places that currently call rpc_xprt_ops->close to call > > something like a (new) ops->shutdown routine instead, with the > > assumption of course that we can somehow reset and reuse the socket > > afterward. > > > > The question is how best to emulate the effect of ->close without > > actually freeing the socket. Is it sufficient to call something like > > kernel_sock_shutdown() on it, and twiddle all of the bits to ensure > > that the connect_worker will get called afterward? > > The reason why we close these sockets is that if the attempt at aborting > the connection fails, then they typically end up in a TIME_WAIT state. > I'm still trying to wade through the xprtsock.c socket handling code, but it looks like we currently tear down the connection in 3 different ways: xs_close: which basically calls sock_release and gets rid of our reference to an existing socket. Most of the places where we disconnect the socket use this. After this, we end up with srcport == 0 which makes it pick a new port. xs_tcp_shutdown: which calls ->shutdown on it, but doesn't free anything. This also preserves the existing srcport. xs_abort_connection: calls kernel_connect to reconnect the socket to AF_UNSPEC address (effectively disconnecting it?). This also preserves the srcport. I guess we use this just before reconnecting when the remote end drops the connection, since we don't need to be graceful about tearing anything down at that point. The last one actually does reuse the same socket, so my thinking was that we could extend that scheme to the other cases. If we called ->shutdown on it and then reconnected it to AF_UNSPEC, would that "reset" it back to a usable state? If there really is no alternative to freeing the socket, then the only real fix I can see is to set PF_MEMALLOC when we go to create it and then reset it afterward. That's a pretty ugly fix though... -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html