What does this error mean?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Late update to this thread, but just so you don't go down the wrong road on this - it was almost definitely not an IPtables problem - they are never turned on here, and even if they were, absolutely no custom rules would have been running.

James Burnash, Unix Engineering

-----Original Message-----
From: Pranith Kumar. Karampuri [mailto:pranithk at gluster.com] 
Sent: Monday, March 21, 2011 10:37 PM
To: Mohit Anchlia
Cc: Burnash, James; gluster-users at gluster.org
Subject: Re: What does this error mean?

hi,
    Whenever a peer goes down all the other machines in the cluster keep on trying to re-connect to it. And when the peer comes backup again the re-connectiion will succeed.  The only times we have seen problems are change in ip-address and issue with ip-tables. We will have to investigate as to what might have happened. Considering the restart fixed the problem, it is not the change in ip-address. We shall try reproducing it with ip-tables issue.

Pranith.

----- Original Message -----
From: "Mohit Anchlia" <mohitanchlia at gmail.com>
To: "James Burnash" <jburnash at knight.com>, gluster-users at gluster.org
Sent: Tuesday, March 22, 2011 12:54:52 AM
Subject: Re: What does this error mean?

I also think there might be a bug where gluster continues to use bad
socket instead of trying to re-establish connection. Not sure why that
is and how that works when one machine fails and comes backup. Can
someone from gluster developer team look at this and provide some
insight?

On Mon, Mar 21, 2011 at 12:20 PM, Mohit Anchlia <mohitanchlia at gmail.com> wrote:
> Is node 5 still showing "Disconnected" for node 6?
>
> On Mon, Mar 21, 2011 at 12:08 PM, Burnash, James <jburnash at knight.com> wrote:
>> After 'service glusterd restart' on node 6:
>>
>> root at jc1letgfs5:/etc/glusterd/vols# gluster peer status
>> Number of Peers: 3
>>
>> Hostname: jc1letgfs6
>> Uuid: cd590fad-022c-4b9a-97f5-3262080d772d
>> State: Peer in Cluster (Disconnected)
>>
>> Hostname: jc1letgfs7
>> Uuid: c5f40de4-9bb1-47ad-93b6-d52c6689ee29
>> State: Peer in Cluster (Connected)
>>
>> Hostname: jc1letgfs8
>> Uuid: 13f4ce3f-042e-4144-a76c-d2b1b91676bd
>> State: Peer in Cluster (Connected)
>>
>> BUT ... after 'service glusterd restart' on node 5:
>>
>> root at jc1letgfs5:/etc/glusterd/vols# gluster peer status
>> Number of Peers: 3
>>
>> Hostname: jc1letgfs7
>> Uuid: c5f40de4-9bb1-47ad-93b6-d52c6689ee29
>> State: Peer in Cluster (Connected)
>>
>> Hostname: jc1letgfs8
>> Uuid: 13f4ce3f-042e-4144-a76c-d2b1b91676bd
>> State: Peer in Cluster (Connected)
>>
>> Hostname: jc1letgfs6
>> Uuid: cd590fad-022c-4b9a-97f5-3262080d772d
>> State: Peer in Cluster (Connected)
>>
>> Works now. Thanks so much. I suspect a race condition of some sort, though what I'll leave up to the devs.
>>
>> -----Original Message-----
>> From: Mohit Anchlia [mailto:mohitanchlia at gmail.com]
>> Sent: Monday, March 21, 2011 2:57 PM
>> To: Burnash, James; gluster-users at gluster.org
>> Subject: Re: What does this error mean?
>>
>> At this point can you do /etc/init.d/gluster stop and then start and
>> see if this changes anything? Or do you see same behaviour? I am
>> thinking gluster might have tried to start too soon on reboot.
>>
>> On Mon, Mar 21, 2011 at 11:43 AM, Burnash, James <jburnash at knight.com> wrote:
>>> Short answers - yes all on the same subnet.
>>> Every host can ping the others
>>> Iptables shows empty entries for all filters
>>>
>>> Details are here - http://pastebin.com/eKtRMbGE
>>>
>>> I did explicitly turn the iptables off again, and then checked again:
>>>
>>> jc1letgfs5
>>> Firewall is stopped.
>>>
>>> jc1letgfs6
>>> Firewall is stopped.
>>>
>>> jc1letgfs7
>>> Firewall is stopped.
>>>
>>> jc1letgfs8
>>> Firewall is stopped.
>>>
>>> Thanks,
>>>
>>> James
>>>
>>> -----Original Message-----
>>> From: Mohit Anchlia [mailto:mohitanchlia at gmail.com]
>>> Sent: Monday, March 21, 2011 2:25 PM
>>> To: Burnash, James
>>> Cc: gluster-users at gluster.org
>>> Subject: Re: What does this error mean?
>>>
>>> Are they in same subnet? What happens if you ping these hosts
>>> individually? Do they ping?
>>>
>>> I closely looked at the error you posted and "connection to
>>> 10.20.72.157:24007 failed (No route to host" points to either firewall
>>> issue or could be a switch issue on the network. Ping test on each
>>> host to each other will be helpful.
>>>
>>> Can you post results of ping and also "service iptables status" from each node?
>>>
>>> On Mon, Mar 21, 2011 at 11:16 AM, Burnash, James <jburnash at knight.com> wrote:
>>>> A little more information:
>>>>
>>>> From the original (first peer node):
>>>> root at jc1letgfs5:/etc/glusterd/vols# gluster peer status
>>>> Number of Peers: 3
>>>>
>>>> Hostname: jc1letgfs6
>>>> Uuid: cd590fad-022c-4b9a-97f5-3262080d772d
>>>> State: Peer in Cluster (Disconnected)
>>>>
>>>> Hostname: jc1letgfs7
>>>> Uuid: c5f40de4-9bb1-47ad-93b6-d52c6689ee29
>>>> State: Peer in Cluster (Connected)
>>>>
>>>> Hostname: jc1letgfs8
>>>> Uuid: 13f4ce3f-042e-4144-a76c-d2b1b91676bd
>>>> State: Peer in Cluster (Connected)
>>>>
>>>>
>>>> From the problem node:
>>>> *** NOTE - only one Peer seen
>>>> root at jc1letgfs6:~# gluster peer status
>>>> Number of Peers: 1
>>>>
>>>> Hostname: 10.20.72.156
>>>> Uuid: 95e1d79a-632a-4774-9d7e-a7234cb084ca
>>>> State: Peer in Cluster (Connected)
>>>>
>>>>
>>>> From a different peer node:
>>>> root at jc1letgfs8:~# gluster peer status
>>>> Number of Peers: 3
>>>>
>>>> Hostname: jc1letgfs6
>>>> Uuid: cd590fad-022c-4b9a-97f5-3262080d772d
>>>> State: Peer Rejected (Connected)
>>>>
>>>> Hostname: jc1letgfs7
>>>> Uuid: c5f40de4-9bb1-47ad-93b6-d52c6689ee29
>>>> State: Peer in Cluster (Connected)
>>>>
>>>> Hostname: 10.20.72.156
>>>> Uuid: 95e1d79a-632a-4774-9d7e-a7234cb084ca
>>>> State: Peer in Cluster (Connected)
>>>>
>>>> -----Original Message-----
>>>> From: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] On Behalf Of Burnash, James
>>>> Sent: Monday, March 21, 2011 2:05 PM
>>>> To: Mohit Anchlia
>>>> Cc: gluster-users at gluster.org
>>>> Subject: Re: What does this error mean?
>>>>
>>>> I did do this, and noting in particular stands out.
>>>>
>>>> I'll exercise it some more, and see if we can get something that will at least point in the proper direction.
>>>>
>>>> I suspect that another reboot of the affected machine will fix this condition - but it won't help me understand the root problem the next time this happens.
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>> -----Original Message-----
>>>> From: Mohit Anchlia [mailto:mohitanchlia at gmail.com]
>>>> Sent: Monday, March 21, 2011 12:40 PM
>>>> To: Burnash, James
>>>> Cc: gluster-users at gluster.org
>>>> Subject: Re: What does this error mean?
>>>>
>>>> Can you turn on DEBUG and see if there is something that stands out?
>>>>
>>>> On Mon, Mar 21, 2011 at 9:34 AM, Burnash, James <jburnash at knight.com> wrote:
>>>>> Does anybody have any clue as to why this is happening? The problem has persisted for several days now, but I can't find anything at all in the logs to possibly explain why this is so.
>>>>>
>>>>> -----Original Message-----
>>>>> From: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] On Behalf Of Burnash, James
>>>>> Sent: Wednesday, March 16, 2011 9:10 AM
>>>>> To: gluster-users at gluster.org
>>>>> Subject: [SPAM?] What does this error mean?
>>>>> Importance: Low
>>>>>
>>>>> Hello.
>>>>>
>>>>> After purposely crashing (via ' echo b>/proc/sysrq-trigger ) node jc1letgfs6 to test mirroring, even after the node has rebooted and is back online I am still seeing the statement "Disconnected" for that node when I execute the following command on the first storage node:
>>>>>
>>>>> root at jc1letgfs5:/etc/glusterd/vols# gluster peer status
>>>>> Number of Peers: 3
>>>>>
>>>>> Hostname: jc1letgfs6
>>>>> Uuid: cd590fad-022c-4b9a-97f5-3262080d772d
>>>>> State: Peer in Cluster (Disconnected)
>>>>>
>>>>> Hostname: jc1letgfs7
>>>>> Uuid: c5f40de4-9bb1-47ad-93b6-d52c6689ee29
>>>>> State: Peer in Cluster (Disconnected)
>>>>>
>>>>> Hostname: jc1letgfs8
>>>>> Uuid: 13f4ce3f-042e-4144-a76c-d2b1b91676bd
>>>>> State: Peer in Cluster (Connected)
>>>>>
>>>>> This is running on 4 servers with CentOS 5.5 (x86_64), GlusterFS 3.1.1
>>>>>
>>>>> Here is the volume info:
>>>>>
>>>>> # gluster volume info
>>>>>
>>>>> Volume Name: test-pfs-ro1
>>>>> Type: Distributed-Replicate
>>>>> Status: Started
>>>>> Number of Bricks: 4 x 2 = 8
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: jc1letgfs5:/export/read-only/g01
>>>>> Brick2: jc1letgfs6:/export/read-only/g01
>>>>> Brick3: jc1letgfs5:/export/read-only/g02
>>>>> Brick4: jc1letgfs6:/export/read-only/g02
>>>>> Brick5: jc1letgfs7:/export/read-only/g01
>>>>> Brick6: jc1letgfs8:/export/read-only/g01
>>>>> Brick7: jc1letgfs7:/export/read-only/g02
>>>>> Brick8: jc1letgfs8:/export/read-only/g02
>>>>> Options Reconfigured:
>>>>> performance.stat-prefetch: on
>>>>> performance.cache-size: 2GB
>>>>> network.ping-timeout: 10
>>>>>
>>>>> Even with this error, mirroring functions as expected, and the node is recognized and utilized, as can be seen in this log fragment from jc1letgfs5: /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
>>>>>
>>>>> [2011-03-13 23:51:31.458329] E [socket.c:1656:socket_connect_finish] management: connection to 10.20.72.157:24007 failed (No route to ho
>>>>> st)
>>>>> [2011-03-13 23:53:49.42170] I [glusterd3_1-mops.c:172:glusterd3_1_friend_add_cbk] glusterd: Received ACC from uuid: cd590fad-022c-4b9a-9
>>>>> 7f5-3262080d772d, host: jc1letgfs6, port: 0
>>>>> [2011-03-13 23:53:49.42204] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
>>>>> [2011-03-13 23:53:49.42320] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
>>>>> [2011-03-13 23:53:49.42336] I [glusterd-handler.c:2267:glusterd_handle_friend_update] glusterd: Received friend update from uuid: cd590f
>>>>> ad-022c-4b9a-97f5-3262080d772d
>>>>> [2011-03-13 23:53:49.42359] I [glusterd-handler.c:2312:glusterd_handle_friend_update] : Received uuid: 95e1d79a-632a-4774-9d7e-a7234cb08
>>>>> 4ca, hostname:10.20.72.156
>>>>> [2011-03-13 23:53:49.42412] I [glusterd-handler.c:2315:glusterd_handle_friend_update] : Received my uuid as Friend
>>>>>
>>>>>
>>>>> Any pointers or help would be appreciated.
>>>>>
>>>>> James Burnash, Unix Engineering
>>>>>
>>>>>
>>>>> DISCLAIMER:
>>>>> This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission.
>>>>> NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>>>
>>>
>>
>
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux