df causes hang

phil at cryer.us (phil cryer) · Thu, 3 Feb 2011 15:08:28 -0600

This wasn't my issue, but I'm still having the issue. Today I purged
glusterfs 3.1.1 and installed 3.1.2 fresh from deb. I recreated my
volume, started it, everything was going fine, mounted the share, then
ran df -h to see it, now every few seconds my logs posts this:

==> /var/log/glusterfs/nfs.log <==
[2011-02-03 15:55:57.145626] E
[client-handshake.c:1079:client_query_portmap_cbk]
bhl-volume-client-98: failed to get the port number for remote
subvolume
[2011-02-03 15:55:57.145694] I [client.c:1590:client_rpc_notify]
bhl-volume-client-98: disconnected

==> /var/log/glusterfs/mnt-glusterfs.log <==
[2011-02-03 15:55:57.605802] E [common-utils.c:124:gf_resolve_ip6]
resolver: getaddrinfo failed (Name or service not known)
[2011-02-03 15:55:57.605834] E
[name.c:251:af_inet_client_get_remote_sockaddr] glusterfs: DNS
resolution failed on host /etc/glusterfs/glusterfs.vol

over and over. Any clues as to how I can fix this? This one issue has
made our entire 100TB store unusable.

and again, gluster volume info shows all the bricks are OK, including 98:

gluster> volume info

Volume Name: bhl-volume
Type: Distributed-Replicate
Status: Started
Number of Bricks: 72 x 2 = 144
Transport-type: tcp
Bricks:
[...]
Brick92: clustr-02:/mnt/data16
Brick93: clustr-03:/mnt/data16
Brick94: clustr-04:/mnt/data16
Brick95: clustr-05:/mnt/data16
Brick96: clustr-06:/mnt/data16
Brick97: clustr-01:/mnt/data17
Brick98: clustr-02:/mnt/data17
Brick99: clustr-03:/mnt/data17
Brick100: clustr-04:/mnt/data17
Brick101: clustr-05:/mnt/data17
Brick102: clustr-06:/mnt/data17
Brick103: clustr-01:/mnt/data18
Brick104: clustr-02:/mnt/data18
Brick105: clustr-03:/mnt/data18
[...]

P

On Mon, Jan 31, 2011 at 4:26 PM, Anand Avati <anand.avati at gmail.com> wrote:
> Can you post your server logs? What happens if you run 'df -k' on your
> backend export filesystems?
>
> Thanks
> Avati
>
> On Mon, Jan 17, 2011 at 5:27 AM, Joe Warren-Meeks
> <joe at encoretickets.co.uk>wrote:
>
>>
>> (sorry about topposting.)
>>
>> Just changing the timeout would only mask the problem. The real issue is
>> that running 'df' on either node causes a hang.
>>
>> All other operations seem fine, files can be created and deleted as
>> normal with the results showing up on both.
>>
>> I'd like to work out why it's hanging on df so I can fix it and get my
>> monitoring and cron scripts running again :)
>>
>> ?-- joe.
>>
>> -----Original Message-----
>> From: gluster-users-bounces at gluster.org
>> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Daniel Maher
>> Sent: 17 January 2011 12:48
>> To: gluster-users at gluster.org
>> Subject: Re: df causes hang
>>
>> On 01/17/2011 10:47 AM, Joe Warren-Meeks wrote:
>> > Hey chaps,
>> >
>> > Anyone got any pointers as to what this might be? This is still
>> causing
>> > a lot of problems for us whenever we attempt to do df.
>> >
>> > ? -- joe.
>> >
>> > -----Original Message-----
>>
>> > However, for some reason, they've got into a bit of a state such that
>> > typing 'df -k' causes both to hang, resulting in a loss of service
>> for42
>> > seconds. I see the following messages in the log files:
>> >
>> >
>>
>> 42 seconds is the default tcp timeout time for any given node - you
>> could try tuning that down and seeing how it works for you.
>>
>> http://www.gluster.com/community/documentation/index.php/Gluster_3.1:_Se
>> tting_Volume_Options
>>
>>
>> --
>> Daniel Maher <dma+gluster AT witbe DOT net>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>

-- 
http://philcryer.com