On Thu, Jun 15, 2006 at 02:09:59PM -0500, David Teigland wrote: > On Thu, Jun 15, 2006 at 01:43:25AM +0300, Anton Kornev wrote: > > > Is there any ideas of how to fix this? I mean either the reason ('D' > > state of killed httpd-s) or consequences (the GFS filesystem fully or > > partially become unavailable after this). > > > > I also appreciate any help with debugging the problem. > > > > I tried gfs_tool lockdump with decipher_lockstate_dump tool. > > I don't see anything wrong in the lockdumps you gave, although I'm not an > expert at interpreting gfs lockdumps. Could you do a ps showing the wchan > for those processes? Using sysrq to get a stack dump would also be useful. > You might also do a dlm lock dump and pick out those locks: > echo "lockspace name" >> /proc/cluster/dlm_locks > cat /proc/cluster/dlm_locks > > I/O stuck in gnbd could also be a problem, I'm not sure what the signs of > that might be apart from possibly the wchan. To check for GNBD lockups, there are a couple of useful places to look. Are there any messages in the logs of any of the nodes (particularly the hanging gnbd client and the gnbd server node) that provide any clues? Do a # gnbd_import -l on all the gnbd client machines. The 'State:' line is the important one. For all the devices you are using, the first to values should be "Open" and "Connected". If it doesn't say "Connected" you've lost connection to the server for some reason. The log messages should provide a clue. If the last value says "Clear", then there is no outstanding IO to the server. If it says "Pending", do a # cat /sys/class/gnbd/gnbd<minor_nr>/waittime Run the command a couple of times. This is the time since the server has last fulfilled an outstanding request. If there are no oustanding requests, it will be -1. If the value keeps getting larger, then there is pending IO to the server. Run # gnbd_export -L on the server machine. You should see a process for each exported device for each client. If there is pending IO to the server, a stack trace of the server process will show where it's stuck. The other place GNBD could be stuck is waiting on some internal lock. A stack should point that out. -Ben > > Dave > > -- > > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster