Re: AFR: machine crash hangs other mounts or transport endpoint not connected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Krishna Srinivas wrote:
On Thu, May 8, 2008 at 9:19 PM, Gerry Reno <greno@xxxxxxxxxxx> wrote:
Krishna Srinivas wrote:

Gerry,

In your client spec "client-local" does not have any purpose right?

This is your setup:
server1 and server2 have /home/vmail/mailbrick as storage exports.
on client you have an AFR which connects to server1 and server2.
client mounts it on /home/vmail/mailstore

Can you try mounting on command line instead of fstab?
When you kill one of the servers, can you see if you see anything
in the log files?

Also mention "option transport-timeout 5" in the two "client/protocol"
subvolumes. (so the timeout will be 5 secs)

Thanks
Krishna



 Two machines.
 Each machine has a server storage brick (/home/vmail/mailbrick)
 Each machine also has a client (/home/vmail/mailstore)
 If one of the machines either crashes or needs to be rebooted then it hangs
the client mount on the other machine.

 I'll umount the mount from fstab and remount from command line and let you
know.

Also mention "option transport-timeout 5" in the two "client/protocol"
subvolumes. (so the timeout will be 5 secs)

 Regards,
 Gerry




Ok, I ran some tests:
First, when I started I noticed that on one machine when I did a 'df' that I would see two client mounts and on the other machine I would see one client mount. I unmounted the clients from fstab and then changed the client.vol to include the option transport-timeout 5. Then I started the clients from the command line. I see one client mount on each machine. I kill one machine. The other machine still functions. Did this a couple times. Then I went and left the timeout in the vol and just rebooted both machines. They both came back up and df shows two client mounts on both machines. ps shows two client processes on both machines. I kill one machine again and the other machine still functions. So I was not able to recreate hang.

I check logs and I can see in the log that there are thousands of lines like the following over the past weeks in both logs:

2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk] client2: no proper reply from server, returning ENOTCONN 2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: non-blocking connect() returned: 111 (Connection refused) 2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] client2: not connected at the moment to submit frame type(1) op(22) 2008-04-26 00:27:55 E [client-protocol.c:3742:client_opendir_cbk] client2: no proper reply from server, returning ENOTCONN 2008-04-26 00:27:55 E [afr_self_heal.c:290:afr_lds_opendir_cbk] afr: op_ret=-1 op_errno=107 2008-04-26 00:27:55 E [afr_self_heal.c:290:afr_lds_opendir_cbk] afr: op_ret=-1 op_errno=24 2008-04-26 00:27:55 E [fuse-bridge.c:459:fuse_entry_cbk] glusterfs-fuse: 11084: (34) /example.com/john => -1 (5) 2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: non-blocking connect() returned: 111 (Connection refused) 2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] client2: not connected at the moment to submit frame type(1) op(34) 2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk] client2: no proper reply from server, returning ENOTCONN 2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: non-blocking connect() returned: 111 (Connection refused) 2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] client2: not connected at the moment to submit frame type(1) op(34) 2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk] client2: no proper reply from server, returning ENOTCONN 2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2: non-blocking connect() returned: 111 (Connection refused) 2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer] client2: not connected at the moment to submit frame type(1) op(34)


2008-04-25 19:47:47 E [afr.c:2018:afr_open_cbk] afr: (path=/example.com/john/dovecot-uidlist.lock child=client2) op_ret=-1 op_errno=2 2008-04-25 19:47:47 E [afr.c:2018:afr_open_cbk] afr: (path=/example.com/john/dovecot-uidlist.lock child=client1) op_ret=-1 op_errno=2 2008-04-25 19:47:47 E [fuse-bridge.c:692:fuse_fd_cbk] glusterfs-fuse: 5775: (12) /example.com/john/dovecot-uidlist.lock => -1 (2)

2008-04-25 13:09:02 W [fuse-bridge.c:402:fuse_entry_cbk] glusterfs-fuse: 3883: (34) /example.com/gerryreno/dovecot-keywords => 566935 Rehashing because st_nlink less than dentry maps 2008-04-25 13:09:02 E [fuse-bridge.c:1140:fuse_unlink] glusterfs-fuse: 3894: UNLINK /example.com/gerryreno/dovecot-uidlist (fuse_loc_fill() returned NULL inode)




Anyway, I wasn't able to see the hang using the transport-timeout. I'm trying to think about why there are two client mounts from fstab though. That seems strange.

Regards,
Gerry



[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux