On Wed, Jul 09, 2008 at 09:51:02AM +0100, Christine Caulfield wrote: > Steven Whitehouse wrote: >> Hi, >> >> On Tue, 2008-07-08 at 18:15 -0400, J. Bruce Fields wrote: >>> On Mon, Jul 07, 2008 at 02:49:28PM -0400, bfields wrote: >>>> On Mon, Jul 07, 2008 at 10:48:28AM -0500, David Teigland wrote: >>>>> On Sun, Jul 06, 2008 at 05:51:05PM -0400, J. Bruce Fields wrote: >>>>>> - write(control_fd, in, sizeof(struct gdlm_plock_info)); >>>>>> + write(control_fd, in, sizeof(struct dlm_plock_info)); >>>>> Gah, sorry, I keep fixing that and it keeps reappearing. >>>>> >>>>> >>>>>> Jul 1 14:06:42 piglet2 kernel: dlm: connect from non cluster node >>>>>> It looks like dlm_new_workspace() is waiting on dlm_recoverd, which is >>>>>> in "D" state in dlm_rcom_status(), so I guess the second node isn't >>>>>> getting some dlm reply it expects? >>>>> dlm inter-node communication is not working here for some reason. There >>>>> must be something unusual with the way the network is configured on the >>>>> nodes, and/or a problem with the way the cluster code is applying the >>>>> network config to the dlm. >>>>> >>>>> Ah, I just remembered what this sounds like; we see this kind of thing >>>>> when a network interface has multiple IP addresses, and/or routing is >>>>> configured strangely. Others cc'ed could offer better details on exactly >>>>> what to look for. >>>> OK, thanks! I'm trying to run gfs2 on 4 kvm machines, I'm an expert on >>>> neither, and it's entirely likely there's some obvious misconfiguration. >>>> On the kvm host there are 4 virtual interfaces bridged together: >>> I ran wireshark on vnet0 while doing the second mount; what I saw was >>> the second machine opened a tcp connection to port 21064 on the first >>> (which had already completed the mount), and sent it a single message >>> identified by wireshark as "DLM3" protocol, type recovery command: >>> status command. It got back an ACK then a RST. >>> >>> Then the same happened in the other direction, with the first machine >>> sending a similar message to port 21064 on the second, which then reset >>> the connection. >>> > > That's a symptom of the "connect from non-cluster node" error in the > DLM. I think I am getting a message to that affect in my logs. > It's got a connection from an IP address that is not known to cman. > So it closes it as a spoofer OK. Is there an easy way to see the list of ip addresses known to cman? > You'll need to check the routing of the interfaces. The most common > cause of this sort of error is having two interfaces on the same > physical (or internal) network. Thanks, that's helpful. --b. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster