Moreno Baricevic wrote: > > Hello, > > we are trying to install GFS (cluster-1.02 on vanilla 2.6.16.16) on a > CentOS cluster of 70 "diskless" nodes. > > The structure is something like this: > > +---+ GNBD-SERVERS GNBD CLIENTS > | |-----[node63]-----[node64 node65 node66 node67 node68 node69] > | S |..... > | A |..... > | N |-----[node07]-----[node08 node09 node10 node11 node12 node13] > | |-----[node00]-----[node01 node02 node03 node04 node05 node06] > +---+ > > All the nodes have a gigabit NIC and all the nodes see each other. > Only the gnbd-servers have a fiber adapter to connect to the SAN. > > Everything works fine as far as we test on 33 nodes: 9 nodes with the > fiber adapter (acting as both GFS nodes and gnbd-servers) and 24 gnbd > clients (connected to 4 of the gnbd-servers). "Fine" means that we have > been able to mount and use the GFS filesystem. > > When we try to start cman on 39 nodes (or worst, when we try with 63 > nodes), more or less half of the nodes soon get this: > > "kernel panic - not syncing: membership stopped responding" > > We tried to increase CMAN_CLUSTER_TIMEOUT and CMAN_QUORUM_TIMEOUT > (/etc/init.d/cman), but the problem persists. > > We tried to boot the nodes 10 at once, with a 2 minutes delay between > groups. As soon as we reach the quorum (or one of the timeout?) the > nodes start collapsing due to "Inconsistent cluster view", "Shutdown", > "No response to messages". > > We also tried the patch supplied as solution for the bug report 187777, > but nothing changes. > > Is there a limit on the number of nodes, a timeout, or any other issue > that we didn't consider? To be honest, cman has never been tested beyond 32 nodes to my knowledge. for large clusters you may well be better off using gulm - at least in the short-term. > Here you can find the cluster.conf, logs from survived and dead nodes, > tcpdump for UDP:6809, nodes' /proc/cluster/{status,nodes,services}: > > http://www.democritos.it/~baro/gfs-test/ > > There's a lot of stuff, let me know if you need something more specific. I'll have a look through those logs...but it may take me some time ! Thanks, Patrick -- patrick -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster