df causes hang

phil at cryer.us (phil cryer) · Thu, 3 Feb 2011 21:07:22 -0600

Avati - thanks for your reply, my comments below

>> [name.c:251:af_inet_client_get_remote_sockaddr] glusterfs: DNS
>> resolution failed on host /etc/glusterfs/glusterfs.vol

> Please make sure you are able to resolve hostnames as given in volume info
> in all of your servers via 'dig'. The logs clearly show that host resolution
> seems to be failing.

Agreed, however that does seem to be the issue because I can dig the
host (they're all defined in my hosts file too so it doesn't have to
look them up) named clustr-02 and in fact there are 23 other 'bricks'
on that host that are working fine:

# gluster volume info | grep clustr-02
Brick2: clustr-02:/mnt/data01
Brick8: clustr-02:/mnt/data02
Brick14: clustr-02:/mnt/data03
Brick20: clustr-02:/mnt/data04
Brick26: clustr-02:/mnt/data05
Brick32: clustr-02:/mnt/data06
Brick38: clustr-02:/mnt/data07
Brick44: clustr-02:/mnt/data08
Brick50: clustr-02:/mnt/data09
Brick56: clustr-02:/mnt/data10
Brick62: clustr-02:/mnt/data11
Brick68: clustr-02:/mnt/data12
Brick74: clustr-02:/mnt/data13
Brick80: clustr-02:/mnt/data14
Brick86: clustr-02:/mnt/data15
Brick92: clustr-02:/mnt/data16
Brick98: clustr-02:/mnt/data17
Brick104: clustr-02:/mnt/data18
Brick110: clustr-02:/mnt/data19
Brick116: clustr-02:/mnt/data20
Brick122: clustr-02:/mnt/data21
Brick128: clustr-02:/mnt/data22
Brick134: clustr-02:/mnt/data23
Brick140: clustr-02:/mnt/data24

I logged into that host, unmounted that mount, ran fsck.ext4 on it,
but it came back clean.

Also thing, the log says: "glusterfs: DNS >> resolution failed on host
/etc/glusterfs/glusterfs.vol" - however, there is obviously no host
named  /etc/glusterfs/glusterfs.vol - does this point to an issue?

And lastly, I even have a file named /etc/glusterfs/glusterfs.vol"

ls -ls /etc/glusterfs
-rw-r--r-- 1 root root  229 Jan 16 21:15 glusterd.vol
-rw-r--r-- 1 root root 1908 Jan 16 21:15 glusterfsd.vol.sample
-rw-r--r-- 1 root root 2005 Jan 16 21:15 glusterfs.vol.sample

I created all of the configs via the gluster> commandline tool.

Thanks

P

On Thu, Feb 3, 2011 at 6:39 PM, Anand Avati <anand.avati at gmail.com> wrote:
> Please make sure you are able to resolve hostnames as given in volume info
> in all of your servers via 'dig'. The logs clearly show that host resolution
> seems to be failing.
> Avati
>
> On Thu, Feb 3, 2011 at 1:08 PM, phil cryer <phil at cryer.us> wrote:
>>
>> This wasn't my issue, but I'm still having the issue. Today I purged
>> glusterfs 3.1.1 and installed 3.1.2 fresh from deb. I recreated my
>> volume, started it, everything was going fine, mounted the share, then
>> ran df -h to see it, now every few seconds my logs posts this:
>>
>> ==> /var/log/glusterfs/nfs.log <==
>> [2011-02-03 15:55:57.145626] E
>> [client-handshake.c:1079:client_query_portmap_cbk]
>> bhl-volume-client-98: failed to get the port number for remote
>> subvolume
>> [2011-02-03 15:55:57.145694] I [client.c:1590:client_rpc_notify]
>> bhl-volume-client-98: disconnected
>>
>> ==> /var/log/glusterfs/mnt-glusterfs.log <==
>> [2011-02-03 15:55:57.605802] E [common-utils.c:124:gf_resolve_ip6]
>> resolver: getaddrinfo failed (Name or service not known)
>> [2011-02-03 15:55:57.605834] E
>> [name.c:251:af_inet_client_get_remote_sockaddr] glusterfs: DNS
>> resolution failed on host /etc/glusterfs/glusterfs.vol
>>
>> over and over. Any clues as to how I can fix this? This one issue has
>> made our entire 100TB store unusable.
>>
>> and again, gluster volume info shows all the bricks are OK, including 98:
>>
>> gluster> volume info
>>
>> Volume Name: bhl-volume
>> Type: Distributed-Replicate
>> Status: Started
>> Number of Bricks: 72 x 2 = 144
>> Transport-type: tcp
>> Bricks:
>> [...]
>> Brick92: clustr-02:/mnt/data16
>> Brick93: clustr-03:/mnt/data16
>> Brick94: clustr-04:/mnt/data16
>> Brick95: clustr-05:/mnt/data16
>> Brick96: clustr-06:/mnt/data16
>> Brick97: clustr-01:/mnt/data17
>> Brick98: clustr-02:/mnt/data17
>> Brick99: clustr-03:/mnt/data17
>> Brick100: clustr-04:/mnt/data17
>> Brick101: clustr-05:/mnt/data17
>> Brick102: clustr-06:/mnt/data17
>> Brick103: clustr-01:/mnt/data18
>> Brick104: clustr-02:/mnt/data18
>> Brick105: clustr-03:/mnt/data18
>> [...]
>>
>>
>> P
>>
>>
>> On Mon, Jan 31, 2011 at 4:26 PM, Anand Avati <anand.avati at gmail.com>
>> wrote:
>> > Can you post your server logs? What happens if you run 'df -k' on your
>> > backend export filesystems?
>> >
>> > Thanks
>> > Avati
>> >
>> > On Mon, Jan 17, 2011 at 5:27 AM, Joe Warren-Meeks
>> > <joe at encoretickets.co.uk>wrote:
>> >
>> >>
>> >> (sorry about topposting.)
>> >>
>> >> Just changing the timeout would only mask the problem. The real issue
>> >> is
>> >> that running 'df' on either node causes a hang.
>> >>
>> >> All other operations seem fine, files can be created and deleted as
>> >> normal with the results showing up on both.
>> >>
>> >> I'd like to work out why it's hanging on df so I can fix it and get my
>> >> monitoring and cron scripts running again :)
>> >>
>> >> ?-- joe.
>> >>
>> >> -----Original Message-----
>> >> From: gluster-users-bounces at gluster.org
>> >> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Daniel Maher
>> >> Sent: 17 January 2011 12:48
>> >> To: gluster-users at gluster.org
>> >> Subject: Re: df causes hang
>> >>
>> >> On 01/17/2011 10:47 AM, Joe Warren-Meeks wrote:
>> >> > Hey chaps,
>> >> >
>> >> > Anyone got any pointers as to what this might be? This is still
>> >> causing
>> >> > a lot of problems for us whenever we attempt to do df.
>> >> >
>> >> > ? -- joe.
>> >> >
>> >> > -----Original Message-----
>> >>
>> >> > However, for some reason, they've got into a bit of a state such that
>> >> > typing 'df -k' causes both to hang, resulting in a loss of service
>> >> for42
>> >> > seconds. I see the following messages in the log files:
>> >> >
>> >> >
>> >>
>> >> 42 seconds is the default tcp timeout time for any given node - you
>> >> could try tuning that down and seeing how it works for you.
>> >>
>> >>
>> >> http://www.gluster.com/community/documentation/index.php/Gluster_3.1:_Se
>> >> tting_Volume_Options
>> >>
>> >>
>> >> --
>> >> Daniel Maher <dma+gluster AT witbe DOT net>
>> >> _______________________________________________
>> >> Gluster-users mailing list
>> >> Gluster-users at gluster.org
>> >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>> >>
>> >>
>> >> _______________________________________________
>> >> Gluster-users mailing list
>> >> Gluster-users at gluster.org
>> >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>> >>
>> >
>> > _______________________________________________
>> > Gluster-users mailing list
>> > Gluster-users at gluster.org
>> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>> >
>> >
>>
>>
>>
>> --
>> http://philcryer.com
>
>

-- 
http://philcryer.com