Mathieu Avila wrote: > Hello GFS team, > > I'm trying to run a GFS filesystem on ~32 nodes, but there is a problem > when i start the daemons. > My nodes are called sam38 -> sam70. > - I run ccsd on all nodes (using init script) at the same time, and it's > ok. > - I run cman on all nodes (using init script) at the same time, and it's > ok. "cman_tool nodes" tell me alll nodes have rejoined the cluster. > - I run fenced on all nodes, one by one every second, and it fails from > sam57 to sam70. > > From the last one that succeeds (sam56), i see : > [root@sam56 ~]# cat /proc/cluster/services > Service Name GID LID State Code > Fence Domain: "default" 1 2 recover 4 - > [21 19 9 8 7 1 2 3 4 6 11 13 16 23 26 27 28 32 33 25 20 14] > > Then, at sam57 , i get in /var/log/messages: > Jul 20 18:38:17 sam57 kernel: CMAN: got WAIT barrier not in phase 1 > TRANSITION.44 (2) > > when trying to run fenced. > > Then i /etc/init.d/fenced stop, and i get : > Jul 20 18:47:44 sam57 fenced[28722]: process_events: service leave failed > Jul 20 18:47:44 sam57 fenced: shutdown succeeded > > > When i start it again: > Jul 20 18:47:45 sam57 fenced[28964]: fence_domain_add: service set level > failed > > > After this step, i did stop everything on sam38 (fenced/ccsd/cman), to > see whether getting one node out would let me get another one in, but i > got this strange message on sam39: > Jul 20 19:14:04 sam39 kernel: CMAN: node sam38 has been removed from the > cluster : No response to messages > Jul 20 19:14:12 sam39 kernel: SM: 00000001 process_recovery_barrier > status=-104 > > > In a previous exprerience, running fenced all at the same time lead to a > global cluster failure (everything did not respond): this is why i tried > to run them one by one. > > > All nodes are 64bits, i use the latest "cluster" code from the STABLE > branch of the CVS. My configuration file is classical and works for a > few nodes (tested many times on 5 nodes with no problem), except for this: > <fence_daemon post_join_delay="30"></fence_daemon> > > Why does fenced daemon fails to start where cman succeeded ? I thought > it was just a service like any other, and was built on the top of CMAN. > Also, what are the known limits of the cluster infrastructure, in terms > of nodes ? > > Do you have any advice on how to remove this problem ? particularly, at > this point, does it change something to choose DLM instead of Gulm ? > (no, i think, but i'd rather be sure) You do seem to have hit a limit. Personally I've only tested up to 31 nodes and recently someone posted to this list with a similar problem on 38 nodes - however, when I looked at the logs it actually seem to fall over at 32! There's nothing hard-coded in cman to limit the number of nodes to that amount and I can't find anything obvious that should cause it to happen. In the mean time all I suggest is that you use gulm for clusters with >= 32 nodes -- patrick -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster