intermittent mount failure

leo at comitale.ca (leo comitale) · Mon, 3 Sep 2012 21:18:47 -0400

Hello,

I've got a basic glusterfs setup, one volume with a single brick on a
single server (i'll call "server") accessed by two clients (i'll call them
"client1" and "client2") all hooked up via normal ethernet.  All the
systems are running CentOS 6.3 and gluster 3.3 (via the RPM's provided on
gluster.org).  I have configured the clients to access the server using
FUSE mounted via fstab using only default settings.  The volume transport
is tcp.

One of the two clients, client1, works perfectly.  It can mount the volume
and access all the data reliably and without any issues.  The second
client, client2, has a lot of problems, sometimes the volume mounts
correctly and everything works well until you reboot when it hangs on start
up and eventually times out trying to mount the volume.  I also see
"interrupted system call" during bootup on client2 whether the mount
succeeds or fails.  I run a separate test setup with a very similar
configuration, but with more clients and they all work fine.  Anyway the
server is configured to filter hosts via iptables and I have opened the
necessary ports specifically for both hosts.  Here's the excerpt from the
iptables configuration:

ACCEPT     tcp  --  [client1ip]        0.0.0.0/0           state NEW tcp
dpts:24007:24010
ACCEPT     tcp  --  [client1ip]        0.0.0.0/0           state NEW tcp
dpt:111
ACCEPT     udp  --  [client1ip]        0.0.0.0/0           state NEW udp
dpt:111
ACCEPT     tcp  --  [client2ip]      0.0.0.0/0           state NEW tcp
dpts:24007:24010
ACCEPT     tcp  --  [client2ip]      0.0.0.0/0           state NEW tcp
dpt:111
ACCEPT     udp  --  [client2ip]      0.0.0.0/0           state NEW udp
dpt:111

I have tried disabling the firewall completely and it appears to make no
difference.  I have the gluster IP access filter explicitly set to allow
all:

Volume Name: data-volume
Type: Distribute
Volume ID: c4140398-393d-414d-9062-d4ce26a90db6
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: server:/data
Options Reconfigured:
auth.allow: *

As I mentioned, client1 is able to access everything without any issues but
client2 will, about 99% of the time, fail to mount the volume at all.  Very
occasionally it will succeed and everything will work fine until you reboot
the machine when it goes back to failing.  Here is what I see in the log
with full debug when I attempt a manual mount (edited to redact addresses)
on client2:

# mount.glusterfs server:data-volume -o log-level=TRACE /data
Mount failed. Please check the log file for more details.
# cat /var/log/glusterfs/data.log
[2012-08-28 16:18:05.679911] I [glusterfsd.c:1666:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.3.0
[2012-08-28 16:18:05.679971] T [xlator.c:198:xlator_dynload] 0-xlator:
attempt to load file /usr/lib64/glusterfs/3.3.0/xlator/mount/fuse.so
[2012-08-28 16:18:05.680242] T [xlator.c:250:xlator_dynload] 0-xlator:
dlsym(reconfigure) on /usr/lib64/glusterfs/3.3.0/xlator/mount/fuse.so:
undefined symbol: reconfigure -- neglecting
[2012-08-28 16:18:05.680274] D [glusterfsd.c:395:create_fuse_mount] 0-:
fuse direct io type 2
[2012-08-28 16:18:05.681122] D [rpc-clnt.c:973:rpc_clnt_connection_init]
0-glusterfs: defaulting frame-timeout to 30mins
[2012-08-28 16:18:05.681200] D [rpc-transport.c:248:rpc_transport_load]
0-rpc-transport: attempt to load file
/usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so
[2012-08-28 16:18:05.681517] T [options.c:77:xlator_option_validate_int]
0-glusterfs: no range check required for 'option remote-port 24007'
[2012-08-28 16:18:05.681555] D
[rpc-clnt.c:1379:rpcclnt_cbk_program_register] 0-glusterfs: New program
registered: GlusterFS Callback, Num: 52743234, Ver: 1
[2012-08-28 16:18:05.681572] T [rpc-clnt.c:429:rpc_clnt_reconnect]
0-glusterfs: attempting reconnect
[2012-08-28 16:18:05.681598] T [common-utils.c:111:gf_resolve_ip6]
0-resolver: DNS cache not present, freshly probing hostname: server
[2012-08-28 16:18:05.763206] D [common-utils.c:151:gf_resolve_ip6]
0-resolver: returning ip-XXX.XXX.XXX.XXX (port-24007) for hostname: server
and port: 24007
[2012-08-28 16:18:05.768441] T [socket.c:370:__socket_nodelay] 0-glusterfs:
NODELAY enabled for socket 8
[2012-08-28 16:18:05.768473] T [socket.c:424:__socket_keepalive]
0-glusterfs: Keep-alive enabled for socket 8, interval 2, idle: 20
[2012-08-28 16:18:08.769439] T [rpc-clnt.c:429:rpc_clnt_reconnect]
0-glusterfs: attempting reconnect
[2012-08-28 16:18:08.769737] T [socket.c:2003:socket_connect]
(-->/lib64/libpthread.so.0() [0x3817e07851]
(-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0) [0x3c9e22a880]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0x99) [0x3c9de0e999])))
0-glusterfs: connect () called on transport already connected
[2012-08-28 16:18:11.770136] T [rpc-clnt.c:429:rpc_clnt_reconnect]
0-glusterfs: attempting reconnect

... (reconnects repeat a total of 22 times all with the same "transport
already connected" message) ...

[2012-08-28 16:19:08.778012] T [rpc-clnt.c:429:rpc_clnt_reconnect]
0-glusterfs: attempting reconnect
[2012-08-28 16:19:08.778295] T [socket.c:2003:socket_connect]
(-->/lib64/libpthread.so.0() [0x3817e07851]
(-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0) [0x3c9e22a880]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0x99) [0x3c9de0e999])))
0-glusterfs: connect () called on transport already connected
[2012-08-28 16:19:08.881543] E [socket.c:1715:socket_connect_finish]
0-glusterfs: connection to  failed (Connection timed out)
[2012-08-28 16:19:08.881602] D [socket.c:280:__socket_disconnect]
0-glusterfs: shutdown() returned -1. Transport endpoint is not connected
[2012-08-28 16:19:08.881629] T [rpc-clnt.c:535:rpc_clnt_connection_cleanup]
0-glusterfs: cleaning up state in transport object 0x182b760
[2012-08-28 16:19:08.881653] E [glusterfsd-mgmt.c:1783:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host: Transport endpoint
is not connected
[2012-08-28 16:19:08.881672] I [glusterfsd-mgmt.c:1786:mgmt_rpc_notify]
0-glusterfsd-mgmt: -1 connect attempts left
[2012-08-28 16:19:08.881739] W [glusterfsd.c:831:cleanup_and_exit]
(-->/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28) [0x3c9de0b018]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xc0) [0x3c9de0f830]
(-->/usr/sbin/glusterfs() [0x40c026]))) 0-: received signum (1), shutting
down
[2012-08-28 16:19:08.881776] D
[glusterfsd-mgmt.c:2154:glusterfs_mgmt_pmap_signout] 0-fsd-mgmt: portmapper
signout arguments not given
[2012-08-28 16:19:08.881803] I [fuse-bridge.c:4643:fini] 0-fuse: Unmounting
'/data'.

The server typically logs nothing when I make these attempts, but
occasionally will log something like this:

# cat /var/log/glusterfs/bricks/data.log
[2012-08-26 03:17:01.378270] I [glusterfsd-mgmt.c:1565:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2012-08-28 13:25:08.231095] I [server-handshake.c:571:server_setvolume]
0-data-volume-server: accepted client from
client2-2896-2012/08/28-13:25:04:222124-data-volume-client-0-0 (version:
3.3.0)
[2012-08-28 13:27:22.860474] W [socket.c:195:__socket_rwv]
0-tcp.data-volume-server: readv failed (Connection timed out)
[2012-08-28 13:27:22.860555] I [server.c:685:server_rpc_notify]
0-data-volume-server: disconnecting connectionfrom
client2-2896-2012/08/28-13:25:04:222124-data-volume-client-0-0
[2012-08-28 13:27:22.860577] I [server-helpers.c:741:server_connection_put]
0-data-volume-server: Shutting down connection
client2-2896-2012/08/28-13:25:04:222124-data-volume-client-0-0
[2012-08-28 13:27:22.860602] I
[server-helpers.c:629:server_connection_destroy] 0-data-volume-server:
destroyed connection of
client2-2896-2012/08/28-13:25:04:222124-data-volume-client-0-0
[2012-08-28 13:27:35.079569] I [server-handshake.c:571:server_setvolume]
0-data-volume-server: accepted client from
client2-2867-2012/08/28-13:27:31:164208-data-volume-client-0-0 (version:
3.3.0)
[2012-08-28 13:35:02.540383] W [socket.c:195:__socket_rwv]
0-tcp.data-volume-server: readv failed (Connection timed out)
[2012-08-28 13:35:02.540462] I [server.c:685:server_rpc_notify]
0-data-volume-server: disconnecting connectionfrom
client2-2867-2012/08/28-13:27:31:164208-data-volume-client-0-0
[2012-08-28 13:35:02.540486] I [server-helpers.c:741:server_connection_put]
0-data-volume-server: Shutting down connection
client2-2867-2012/08/28-13:27:31:164208-data-volume-client-0-0
[2012-08-28 13:35:02.540511] I
[server-helpers.c:629:server_connection_destroy] 0-data-volume-server:
destroyed connection of
client2-2867-2012/08/28-13:27:31:164208-data-volume-client-0-0
[2012-08-28 13:35:19.171448] I [server-handshake.c:571:server_setvolume]
0-data-volume-server: accepted client from
client2-2857-2012/08/28-13:35:10:386705-data-volume-client-0-0 (version:
3.3.0)
[2012-08-28 13:41:36.513626] W [socket.c:195:__socket_rwv]
0-tcp.data-volume-server: readv failed (Connection timed out)
[2012-08-28 13:41:36.513707] I [server.c:685:server_rpc_notify]
0-data-volume-server: disconnecting connectionfrom
client2-2857-2012/08/28-13:35:10:386705-data-volume-client-0-0
[2012-08-28 13:41:36.513726] I [server-helpers.c:741:server_connection_put]
0-data-volume-server: Shutting down connection
client2-2857-2012/08/28-13:35:10:386705-data-volume-client-0-0
[2012-08-28 13:41:36.513752] I
[server-helpers.c:629:server_connection_destroy] 0-data-volume-server:
destroyed connection of
client2-2857-2012/08/28-13:35:10:386705-data-volume-client-0-0

I don't know if these readv failures are actually related to the problem
though.

Thanks,
Leo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20120903/4f06a6f9/attachment.htm>