Hello,
we are trying to install GFS (cluster-1.02 on vanilla 2.6.16.16) on a
CentOS cluster of 70 "diskless" nodes.
The structure is something like this:
+---+ GNBD-SERVERS GNBD CLIENTS
| |-----[node63]-----[node64 node65 node66 node67 node68 node69]
| S |.....
| A |.....
| N |-----[node07]-----[node08 node09 node10 node11 node12 node13]
| |-----[node00]-----[node01 node02 node03 node04 node05 node06]
+---+
All the nodes have a gigabit NIC and all the nodes see each other.
Only the gnbd-servers have a fiber adapter to connect to the SAN.
Everything works fine as far as we test on 33 nodes: 9 nodes with the
fiber adapter (acting as both GFS nodes and gnbd-servers) and 24 gnbd
clients (connected to 4 of the gnbd-servers). "Fine" means that we have
been able to mount and use the GFS filesystem.
When we try to start cman on 39 nodes (or worst, when we try with 63
nodes), more or less half of the nodes soon get this:
"kernel panic - not syncing: membership stopped responding"
We tried to increase CMAN_CLUSTER_TIMEOUT and CMAN_QUORUM_TIMEOUT
(/etc/init.d/cman), but the problem persists.
We tried to boot the nodes 10 at once, with a 2 minutes delay between
groups. As soon as we reach the quorum (or one of the timeout?) the nodes
start collapsing due to "Inconsistent cluster view", "Shutdown", "No
response to messages".
We also tried the patch supplied as solution for the bug report 187777,
but nothing changes.
Is there a limit on the number of nodes, a timeout, or any other issue
that we didn't consider?
Here you can find the cluster.conf, logs from survived and dead nodes,
tcpdump for UDP:6809, nodes' /proc/cluster/{status,nodes,services}:
http://www.democritos.it/~baro/gfs-test/
There's a lot of stuff, let me know if you need something more specific.
RTFM's are welcome.
Thanks in advance
Ciao
Moreno
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster