I have a two-node setup on a dual-port SCSI SAN. Note this is just for test purposes. Part of the SAN is a GFS filesystem shared between the two nodes.
When we fetch content to the GFS filesystem via an rsync pull (well, several rsync pulls) on node 1, it runs for a while then node 1 hard
locks (nothing on the console, network dies, console dies, it's frozen
solid). Of course, node 2 notices it and marks node 1 down (/proc/cluster/nodes shows an "X" for node 1 under "Sts"). So the
cluster behaviour is OK. If I "fence-ack-manual -n node1" on node 2,
it runs along happily. I can reboot node 1 and everything returns to
normalcy.
The problem is, why is node 1 dying like this? It is important that this get sorted out as we have a LOT of data to synchronize (rsync is just the test case--we'll probably use a different scheme on deployment), and I suspect it's heavy write activity on that node that's causing the crash.
Oh, both nodes have the GFS filesystem mounted with "-o rw,noatime".
Any ideas would be GREATLY appreciated! ---------------------------------------------------------------------- - Rick Stevens, Senior Systems Engineer rstevens@xxxxxxxxxxxxxxx - - VitalStream, Inc. http://www.vitalstream.com - - - - Do you know how to save five drowning lawyers? No? GOOD! - ----------------------------------------------------------------------