Re: pre6 hanging problems

"August R. Wohlt" <glusterfs@xxxxxxxxxxx> · Thu, 26 Jul 2007 16:00:49 -0400

Hi avati,

When I run it without gdb, it still has the same behavior. It'll run fine
for a few hours under load and then freeze. When it does, the client spews
these to the logs forever. When I kill glusterfs and remount the directory,
everything's fine again:

2007-07-26 12:21:31 D [fuse-bridge.c:344:fuse_entry_cbk] glusterfs-fuse: ERR
=> -1 (107)
2007-07-26 12:21:31 D [inode.c:285:__destroy_inode] fuse/inode: destroy
inode(0)
2007-07-26 12:23:34 W [client-protocol.c:4158:client_protocol_reconnect]
brick: attempting reconnect
2007-07-26 12:23:34 D [tcp-client.c:178:tcp_connect] brick: connection on 4
still in progress - try later
2007-07-26 12:29:51 W [client-protocol.c:4158:client_protocol_reconnect]
brick: attempting reconnect
2007-07-26 12:29:51 E [tcp-client.c:170:tcp_connect] brick: non-blocking
connect() returned: 110 (Connection timed out)

:g

On 7/26/07, Anand Avati <avati@xxxxxxxxxxxxx> wrote:

August,
 It seems to me that you were running the client in GDB, and for some
reason that particular client bailed out. While bailing out the client
raises SIGCONT which has been caught by gdb (gdb catches all signals before
letting the signal handlers take over). the backtrace you have attached is
NOT a crash, you had to just 'c' (continue) at the gdb. And most likely,
this is what has given the 'hung' effect as well.
Is this reproducible for you?

thanks,
avati

2007/7/26, August R. Wohlt <glusterfs@xxxxxxxxxxx>:
>
> Hi all -
>
> I have client and server set up with the pre6 version of gluserfs.
> Several
> times a day the client mount will freeze up as does any command that
> tries
> to read from the mountpoint. I have to kill the glusterfs process,
> unmount
> the directory and remount it to get it to work again.
>
> When this happens, there is another glusterfs client on other machines
> connected to the same server that does not get disconnected. So the
> timeout
> message in the logs is confusing to me. If it's really timing out
> wouldn't
> the other server be disconnected, too?
>
> This is on CentOS 5 with fuse 2.7.0-glfs.
>
> When it happens, here's what shows up in the client:
>
> ...
> 2007-07-25 09:45:59 D [inode.c:327:__active_inode] fuse/inode:
> activating
> inode(4210807), lru=0/1024
> 2007-07-25 09:45:59 D [inode.c:285:__destroy_inode] fuse/inode: destroy
> inode(4210807)
> 2007-07-25 12:37:26 W [client-protocol.c:211:call_bail] brick:
> activating
> bail-out. pending frames = 1. last sent =
> 2007-07-25 12:33:42. last received = 2007-07-25 11:42:59
> transport-timeout =
> 120
> 2007-07-25 12:37:26 C [client-protocol.c:219:call_bail] brick: bailing
> transport
> 2007-07-25 12:37:26 W [client-protocol.c:4189:client_protocol_cleanup]
> brick: cleaning up state in transport object
> 0x80a03d0
> 2007-07-25 12:37:26 W [client-protocol.c:4238:client_protocol_cleanup]
> brick: forced unwinding frame type(0) op(15)
> 2007-07-25 12:37:26 C [tcp.c:81:tcp_disconnect] brick: connection
> disconnected
>
> When it happens, here's what shows up in the server:
>
> 2007-07-25 15:37:40 E [protocol.c:346:gf_block_unserialize_transport]
> libglusterfs/protocol: full_read of block failed: peer (
> 192.168.2.3:1023)
> 2007-07-25 15:37:40 C [tcp.c:81:tcp_disconnect] server: connection
> disconnected
> 2007-07-25 15:37:40 E [protocol.c:251:gf_block_unserialize_transport]
> libglusterfs/protocol: EOF from peer ( 192.168.2.4:1023)
> 2007-07-25 15:37:40 C [tcp.c:81:tcp_disconnect] server: connection
> disconnected
>
> And here's the client backtrace:
>
> (gdb) bt
> #0  0x0032e7a2 in _dl_sysinfo_int80 () from /lib/ld- linux.so.2
> #1  0x005a3824 in raise () from /lib/tls/libpthread.so.0
> #2  0x00655b0c in tcp_bail (this=0x80a03d0) at
> ../../../../transport/tcp/tcp.c:146
> #3  0x00695bbc in transport_bail (this=0x80a03d0) at transport.c :192
> #4  0x00603a16 in call_bail (trans=0x80a03d0) at client-protocol.c:220
> #5  0x00696870 in gf_timer_proc (ctx=0xbffeec30) at timer.c:119
> #6  0x0059d3cc in start_thread () from /lib/tls/libpthread.so.0
> #7  0x00414c3e in clone () from /lib/tls/libc.so.6
>
>
> client config:
>
> ### Add client feature and attach to remote subvolume
> volume brick
>    type protocol/client
>    option transport-type tcp/client     # for TCP/IP transport
>    option remote-host 192.168.2.5       # IP address of the remote brick
>    option remote-subvolume brick_1  # name of the remote volume
> end-volume
>
> # #### Add writeback feature
>   volume brick-wb
>     type performance/write-behind
>     option aggregate-size 131072 # unit in bytes
>     subvolumes brick
>   end-volume
>
> server config:
>
> ### Export volume "brick" with the contents of "/home/export" directory.
>
> volume brick_1
>    type storage/posix
>    option directory /home/vg_3ware1/vivalog/brick_1
> end-volume
>
> volume brick_2
>    type storage/posix
>    option directory /home/vg_3ware1/vivalog/brick_2
> end-volume
>
> ### Add network serving capability to above brick.
> volume server
>    type protocol/server
>    option transport-type tcp/server     # For TCP/IP transport
>    option bind-address 192.168.2.5     # Default is to listen on all
> interfaces
>    subvolumes brick_1
>    option auth.ip.brick_2.allow * # Allow access to "brick" volume
>    option auth.ip.brick_1.allow * # Allow access to "brick" volume
> end-volume
>
> ps I have one server serving two volume bricks to two physically
> distinct
> clients.  I assume this is okay--that I don't need to have two separate
> server declarations.
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

--
Anand V. Avati