Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels

Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> · Thu, 6 Mar 2014 14:52:35 -0500

On Mar 6, 2014, at 14:46, Andrew Martin <amartin@xxxxxxxxxxx> wrote:

>> From: "Trond Myklebust" <trond.myklebust@xxxxxxxxxxxxxxx>
>> On Mar 6, 2014, at 13:35, Andrew Martin <amartin@xxxxxxxxxxx> wrote:
>> 
>>>> From: "Jim Rees" <rees@xxxxxxxxx>
>>>> Why would a bunch of blocked apaches cause high load and reboot?
>>> What I believe happens is the apache child processes go to serve
>>> these requests and then block in uninterruptable sleep. Thus, there
>>> are fewer and fewer child processes to handle new incoming requests.
>>> Eventually, apache would normally kill said children (e.g after a
>>> child handles a certain number of requests), but it cannot kill them
>>> because they are in uninterruptable sleep. As more and more incoming
>>> requests are queued (and fewer and fewer child processes are available
>>> to serve the requests), the load climbs.
>> 
>> Does ‘top’ support this theory? Presumably you should see a handful of
>> non-sleeping apache threads dominating the load when it happens.
> Yes, it looks like the root apache process is still running:
> root      1773  0.0  0.1 244176 16588 ?        Ss   Feb18   0:42 /usr/sbin/apache2 -k start
> 
> All of the others, the children (running as the www-data user), are marked as D.
> 
>> Why is the server becoming ‘unavailable’ in the first place? Are you taking
>> it down?
> I do not know the answer to this. A single NFS server has an export that is
> mounted on multiple servers, including this web server. The web server is
> running Ubuntu 10.04 LTS 2.6.32-57 with nfs-common 1.2.0. Intermittently, the
> NFS mountpoint will become inaccessible on this web server; processes that 
> attempt to access it will block in uninterruptable sleep. While this is 
> occurring, the NFS export is still accessible normally from other clients, 
> so it appears to be related to this particular machine (probably since it is 
> the last machine running Ubuntu 10.04 and not 12.04). I do not know if this 
> is a bug in 2.6.32 or another package on the system, but at this time I 
> cannot upgrade it to 12.04, so I need to find a solution on 10.04. 
> 
> I attempted to get a backtrace from one of the uninterruptable apache processes:
> echo w > /proc/sysrq-trigger
> 
> Here's one example:
> [1227348.003904] apache2       D 0000000000000000     0 10175   1773 0x00000004
> [1227348.003906]  ffff8802813178c8 0000000000000082 0000000000015e00 0000000000015e00
> [1227348.003908]  ffff8801d88f03d0 ffff880281317fd8 0000000000015e00 ffff8801d88f0000
> [1227348.003910]  0000000000015e00 ffff880281317fd8 0000000000015e00 ffff8801d88f03d0
> [1227348.003912] Call Trace:
> [1227348.003918]  [<ffffffffa00a5ca0>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc]
> [1227348.003923]  [<ffffffffa00a5cc4>] rpc_wait_bit_killable+0x24/0x40 [sunrpc]
> [1227348.003925]  [<ffffffff8156a41f>] __wait_on_bit+0x5f/0x90
> [1227348.003930]  [<ffffffffa00a5ca0>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc]
> [1227348.003932]  [<ffffffff8156a4c8>] out_of_line_wait_on_bit+0x78/0x90
> [1227348.003934]  [<ffffffff81086790>] ? wake_bit_function+0x0/0x40
> [1227348.003939]  [<ffffffffa00a6611>] __rpc_execute+0x191/0x2a0 [sunrpc]
> [1227348.003945]  [<ffffffffa00a6746>] rpc_execute+0x26/0x30 [sunrpc]

That basically means that the process is hanging in the RPC layer, somewhere in the state machine. ‘echo 0 >/proc/sys/sunrpc/rpc_debug’ as the ‘root’ user should give us a dump of which state these RPC calls are in. Can you please try that?

_________________________________
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html