Re: Error "Failed to find host nfs1.lightspeed.ca" when adding a new node to the cluster.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2016-04-07 09:16, Atin Mukherjee wrote:
-Atin
Sent from one plus one
On 07-Apr-2016 9:32 pm, "Ernie Dunbar" <maillist@xxxxxxxxxxxxx> wrote:

On 2016-04-06 21:20, Atin Mukherjee wrote:

On 04/07/2016 04:04 AM, Ernie Dunbar wrote:

On 2016-04-06 11:42, Ernie Dunbar wrote:

I've already successfully created a Gluster cluster, but when I
try to
add a new node, gluster on the new node claims it can't find the
hostname of the first node in the cluster.

I've added the hostname nfs1.lightspeed.ca [1] to /etc/hosts like
this:

root@nfs3:/home/ernied# cat /etc/hosts
127.0.0.1    localhost
192.168.1.31    nfs1.lightspeed.ca [1]      nfs1
192.168.1.32    nfs2.lightspeed.ca [2]      nfs2
127.0.1.1    nfs3.lightspeed.ca [3]    nfs3


# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

I can ping the hostname:

root@nfs3:/home/ernied# ping -c 3 nfs1
PING nfs1.lightspeed.ca [1] (192.168.1.31) 56(84) bytes of data.
64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=1
ttl=64
time=0.148 ms
64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=2
ttl=64
time=0.126 ms
64 bytes from nfs1.lightspeed.ca [1] (192.168.1.31): icmp_seq=3
ttl=64
time=0.133 ms

--- nfs1.lightspeed.ca [1] ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.126/0.135/0.148/0.016 ms

I can get gluster to probe the hostname:

root@nfs3:/home/ernied# gluster peer probe nfs1
peer probe: success. Host nfs1 port 24007 already in peer list

But if I try to create the brick on the new node, it says that
the
host can't be found? Um...

root@nfs3:/home/ernied# gluster volume create gv2 replica 3
nfs1.lightspeed.ca:/brick1/gv2/ nfs2.lightspeed.ca:/brick1/gv2/
nfs3.lightspeed.ca:/brick1/gv2
volume create: gv2: failed: Failed to find host
nfs1.lightspeed.ca [1]

Our logs from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log:

[2016-04-06 18:19:18.107459] E [MSGID: 106452]
[glusterd-utils.c:5825:glusterd_new_brick_validate] 0-management:
Failed to find host nfs1.lightspeed.ca [1]
[2016-04-06 18:19:18.107496] E [MSGID: 106536]
[glusterd-volume-ops.c:1364:glusterd_op_stage_create_volume]
0-management: Failed to find host nfs1.lightspeed.ca [1]
[2016-04-06 18:19:18.107516] E [MSGID: 106301]
[glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging
of
operation 'Volume Create' failed on localhost : Failed to find
host
nfs1.lightspeed.ca [1]
[2016-04-06 18:19:18.231864] E [MSGID: 106170]
[glusterd-handshake.c:1051:gd_validate_mgmt_hndsk_req]
0-management:
Request from peer 192.168.1.31:65530 [4] has an entry in
peerinfo, but
uuid does not match

We have introduced a new check to reject a peer if the request is
coming
from a node where the hostname matches but UUID is different. This
can
happen if a node goes through a re-installation and its
/var/lib/glusterd/* content is wiped off. Look at [1] for more
details.

[1] http://review.gluster.org/13519

Do confirm if that's the case.



I couldn't say if that's *exactly* the case, but it's pretty close.
I don't recall ever removing /var/lib/glusterd/* or any of its
contents, but the operating system isn't exactly the way it was when I
first tried to add this node to the cluster.

What should I do to *fix* the problem though, so I can add this node
to the cluster? This bug report doesn't appear to provide a solution.
I've tried removing the node from the cluster, and that failed too.
Things seem to be in a very screwey state right now.

I should have given the work around earlier. Find the peer file for
the faulty node in /var/lib/glusterd/peers/ and delete the same from
all the nodes but the faulty node. Restart glusterd instance on all
those nodes. Ensure /var/lib/glusterd/ content is empty, restart
glusterd and then peer probe this node from any of the node in the
existing cluster. You should also bump up the op-version once cluster
is stable.


This mostly solved the problem, but it seems you were missing one step:

# gluster peer detach <wonky node>

After probing the new node again, I was able to add it to the cluster. Without doing this step, attempting to add the new node to the cluster just resulted in this error message:

volume create: gv0: failed: Host 192.168.1.33 is not in 'Peer in Cluster' state





[2016-04-06 18:19:18.231919] E [MSGID: 106170]
[glusterd-handshake.c:1060:gd_validate_mgmt_hndsk_req]
0-management:
Rejecting management handshake request from unknown peer
192.168.1.31:65530 [4]

That error about the entry in peerinfo doesn't match anything in
Google besides the source code for Gluster. My guess is that my
earlier unsuccessful attempts to add this node before v3.7.10
have
created a conflict that needs to be cleared.



More interesting, is what happens when I try to add the third
server to
the brick from the first gluster server:

root@nfs1:/home/ernied# gluster volume add-brick gv2 replica 3
nfs3:/brick1/gv2
volume add-brick: failed: One or more nodes do not support the
required
op-version. Cluster op-version must atleast be 30600.

Yet, when I view the operating version in
/var/lib/glusterd/glusterd.info [5]:

root@nfs1:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
UUID=1207917a-23bc-4bae-8238-cd691b7082c7
operating-version=30501

root@nfs2:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
UUID=e394fcec-41da-482a-9b30-089f717c5c06
operating-version=30501

root@nfs3:/home/ernied# cat /var/lib/glusterd/glusterd.info [5]
UUID=ae191e96-9cd6-4e2b-acae-18f2cc45e6ed
operating-version=30501

I see that the operating version is the same on all nodes!

Here cluster op-version is pretty old. You need to make sure that
you
bump up the op-version by 'gluster volume set all
cluster.op-version
30710'. add-brick code path has a check that your cluster
op-version has
to be at least 30600 if you are with gluster version >=3.6 which is
the
case here.

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users


Links:
------
[1] http://nfs1.lightspeed.ca
[2] http://nfs2.lightspeed.ca
[3] http://nfs3.lightspeed.ca
[4] http://192.168.1.31:65530
[5] http://glusterd.info
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users



[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux