On Mar 6, 2014, at 14:46, Andrew Martin <amartin@xxxxxxxxxxx> wrote: >> From: "Trond Myklebust" <trond.myklebust@xxxxxxxxxxxxxxx> >> On Mar 6, 2014, at 13:35, Andrew Martin <amartin@xxxxxxxxxxx> wrote: >> >>>> From: "Jim Rees" <rees@xxxxxxxxx> >>>> Why would a bunch of blocked apaches cause high load and reboot? >>> What I believe happens is the apache child processes go to serve >>> these requests and then block in uninterruptable sleep. Thus, there >>> are fewer and fewer child processes to handle new incoming requests. >>> Eventually, apache would normally kill said children (e.g after a >>> child handles a certain number of requests), but it cannot kill them >>> because they are in uninterruptable sleep. As more and more incoming >>> requests are queued (and fewer and fewer child processes are available >>> to serve the requests), the load climbs. >> >> Does ‘top’ support this theory? Presumably you should see a handful of >> non-sleeping apache threads dominating the load when it happens. > Yes, it looks like the root apache process is still running: > root 1773 0.0 0.1 244176 16588 ? Ss Feb18 0:42 /usr/sbin/apache2 -k start > > All of the others, the children (running as the www-data user), are marked as D. > >> Why is the server becoming ‘unavailable’ in the first place? Are you taking >> it down? > I do not know the answer to this. A single NFS server has an export that is > mounted on multiple servers, including this web server. The web server is > running Ubuntu 10.04 LTS 2.6.32-57 with nfs-common 1.2.0. Intermittently, the > NFS mountpoint will become inaccessible on this web server; processes that > attempt to access it will block in uninterruptable sleep. While this is > occurring, the NFS export is still accessible normally from other clients, > so it appears to be related to this particular machine (probably since it is > the last machine running Ubuntu 10.04 and not 12.04). I do not know if this > is a bug in 2.6.32 or another package on the system, but at this time I > cannot upgrade it to 12.04, so I need to find a solution on 10.04. > > I attempted to get a backtrace from one of the uninterruptable apache processes: > echo w > /proc/sysrq-trigger > > Here's one example: > [1227348.003904] apache2 D 0000000000000000 0 10175 1773 0x00000004 > [1227348.003906] ffff8802813178c8 0000000000000082 0000000000015e00 0000000000015e00 > [1227348.003908] ffff8801d88f03d0 ffff880281317fd8 0000000000015e00 ffff8801d88f0000 > [1227348.003910] 0000000000015e00 ffff880281317fd8 0000000000015e00 ffff8801d88f03d0 > [1227348.003912] Call Trace: > [1227348.003918] [<ffffffffa00a5ca0>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc] > [1227348.003923] [<ffffffffa00a5cc4>] rpc_wait_bit_killable+0x24/0x40 [sunrpc] > [1227348.003925] [<ffffffff8156a41f>] __wait_on_bit+0x5f/0x90 > [1227348.003930] [<ffffffffa00a5ca0>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc] > [1227348.003932] [<ffffffff8156a4c8>] out_of_line_wait_on_bit+0x78/0x90 > [1227348.003934] [<ffffffff81086790>] ? wake_bit_function+0x0/0x40 > [1227348.003939] [<ffffffffa00a6611>] __rpc_execute+0x191/0x2a0 [sunrpc] > [1227348.003945] [<ffffffffa00a6746>] rpc_execute+0x26/0x30 [sunrpc] That basically means that the process is hanging in the RPC layer, somewhere in the state machine. ‘echo 0 >/proc/sys/sunrpc/rpc_debug’ as the ‘root’ user should give us a dump of which state these RPC calls are in. Can you please try that? _________________________________ Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html