On Mon, 3 Mar 2008, gordan@xxxxxxxxxx wrote:
I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single node
mounts GFS OK and works, but after a while seems to just block for disk.
[...]
This usually happens after a period of idleness. If the node is used, this
doesn't seem to happen, but leaving it alone for half an hour causes it
to block for disk I/O.
I've done a bit more digging, and the processes that hang seem to do so,
as expected, in disk sleep state.
For example, when trying to log in, sshd hangs. It's status (from /proc)
is:
Name: sshd
State: D (disk sleep)
SleepAVG: 97%
[...]
The only open file handles it has are:
# ls -la /proc/9643/fd/
total 0
dr-x------ 2 root root 0 Mar 3 16:41 .
dr-xr-xr-x 5 root root 0 Mar 3 16:41 ..
lrwx------ 1 root root 64 Mar 3 16:42 0 -> /dev/null
lrwx------ 1 root root 64 Mar 3 16:42 1 -> /dev/null
lrwx------ 1 root root 64 Mar 3 16:42 2 -> /dev/null
lrwx------ 1 root root 64 Mar 3 16:42 3 -> socket:[118904]
lrwx------ 1 root root 64 Mar 3 16:42 4 -> /cdsl.local/var/run/utmp
I am guessing that it's the utmp that is blocking things, but I'm not
sure. I can read-write the /var/run/utmp file just fine (/var/run is
symlinked to /cdsl.local/var/run).
The socked is a TCP socket, so I cannot see that being a disk block issue.
As for /dev/null, I didn't think that could be flock-ed...
Looking at cman_tool status and /proc/drbd, both seem to be in order and
saying everything is working.
Any ideas as to what could be causing these bogus disk-sleep lock-ups?
Gordan
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster