Re: no such file of directory errors running find -exec on large trees

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Avati,

Patch 176 fixed the spurious disconnect problem and I was able to run the find -exec test to completion as well as a reindex on a large zimbra mailbox. I consider this bug closed. Good job.

I am still having a problem that shows up when doing a full backup from Zimbra. This causes many files, some large, to be copied from one folder tree on the cluster to another on the same cluster. I am getting disconnects after failures on large blocks. This ultimately caused the glusterfs client to stall in a state where it had to be killed. Any thoughts?

Harris

----- Original Message -----
From: "Anand Avati" <avati@xxxxxxxxxxxxx>
To: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx>
Cc: "gluster-devel" <gluster-devel@xxxxxxxxxx>
Sent: Monday, June 4, 2007 8:44:54 AM (GMT-0500) America/New_York
Subject: Re: no such file of directory errors running find -exec on large trees

Harris,
  a bug fix in the timer code has been committed which used to cause
spurious transport timeout events causing false bailing. please try
with the latest tla checkout from glusterfs--mainline--2.4.

thanks,
avati

2007/5/26, Harris Landgarten <harrisl@xxxxxxxxxxxxx>:
> I have been running some tests on a gluster tree with over 100,000 files using find. When I run find /mnt/gluster -type f it completes without error. When I combine the find with -exec that causes the files to be read (ie find /mnt/gluster -type f -exec head -n1 {} \; ) The process fails to complete 100% of the time with "No such file or directory" errors. The errors do not occur until after at least 30,000 files have been processed. After the process fails the clients and servers still appear to be running properly. The errors seem to coincide with the following in the logs:
>
> Client running process:
>
> [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> ...
> [May 25 10:12:34] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client2: connection to server disconnected
> [May 25 10:12:34] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:12:34] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:34] [ERROR/tcp.c:110/tcp_except()] transport/tcp:shutdown () - error: Transport endpoint is not connected
> [May 25 10:12:34] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=104)
> [May 25 10:12:34] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client2: connection to server disconnected
> [May 25 10:12:34] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=9)
> ...
>
> brick1:
>
> [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:11:26] [ERROR/common-utils.c:110/full_rwv()] libglusterfs:full_rwv: 50680 bytes r/w instead of 124559 (Broken pipe)
> [May 25 10:11:26] [ERROR/proto-srv.c:117/generic_reply()] protocol/server:transport_writev failed
> [May 25 10:11:26] [ERROR/tcp.c:110/tcp_except()] transport/tcp:shutdown () - error: Transport endpoint is not connected
> [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=107)
> [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:11:43] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=9)
> ...
>
> brick2:
>
> [May 25 10:11:25] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:11:25] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:11:25] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=9)
> [May 25 10:11:25] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:11:43] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:11:43] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=2)
> [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=9)
> [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=9)
> [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=9)
> [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: connection to server disconnected
> ...
>
> Client2 running (logging = debug)
>
> [May 25 10:12:06] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:06] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write to break on blocked socket (if any)
> [May 25 10:12:06] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=115)
> [May 25 10:12:06] [DEBUG/protocol.c:244/gf_block_unserialize_transport()] libglusterfs/protocol:gf_block_unserialize_transport: full_read of header failed
> [May 25 10:12:06] [DEBUG/client-protocol.c:2605/client_protocol_cleanup()] protocol/client:cleaning up state in transport object 0x8077418
> [May 25 10:12:06] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client1: connection to server disconnected
> [May 25 10:12:06] [DEBUG/tcp-client.c:180/tcp_connect()] transport: tcp: :try_connect: socket fd = 4
> [May 25 10:12:06] [DEBUG/tcp-client.c:202/tcp_connect()] transport: tcp: :try_connect: finalized on port `1023'
> [May 25 10:12:06] [DEBUG/tcp-client.c:226/tcp_connect()] tcp/client:try_connect: defaulting remote-port to 6996
> [May 25 10:12:06] [DEBUG/tcp-client.c:262/tcp_connect()] tcp/client:connect on 4 in progress (non-blocking)
> [May 25 10:12:06] [DEBUG/tcp-client.c:301/tcp_connect()] tcp/client:connection on 4 still in progress - try later
> [May 25 10:12:06] [ERROR/client-protocol.c:204/client_protocol_xfer()] protocol/client:transport_submit failed
> [May 25 10:12:06] [DEBUG/client-protocol.c:2605/client_protocol_cleanup()] protocol/client:cleaning up state in transport object 0x8077418
> [May 25 10:12:26] [DEBUG/tcp-client.c:310/tcp_connect()] tcp/client:connection on 4 success, attempting to handshake
> [May 25 10:12:26] [DEBUG/tcp-client.c:54/do_handshake()] transport/tcp-client:dictionary length = 50
> [May 25 10:12:26] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:12:26] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write to break on blocked socket (if any)
> [May 25 10:13:11] [CRITICAL/client-protocol.c:218/call_bail()] client/protocol:bailing transport
> [May 25 10:13:11] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write to break on blocked socket (if any)
> [May 25 10:13:11] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 bytes r/w instead of 113 (errno=115)
> [May 25 10:13:11] [DEBUG/protocol.c:244/gf_block_unserialize_transport()] libglusterfs/protocol:gf_block_unserialize_transport: full_read of header failed
> [May 25 10:13:11] [DEBUG/client-protocol.c:2605/client_protocol_cleanup()] protocol/client:cleaning up state in transport object 0x8077418
> [May 25 10:13:11] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client1: connection to server disconnected
> [May 25 10:13:11] [DEBUG/tcp-client.c:180/tcp_connect()] transport: tcp: :try_connect: socket fd = 4
> [May 25 10:13:11] [DEBUG/tcp-client.c:202/tcp_connect()] transport: tcp: :try_connect: finalized on port `1022'
> [May 25 10:13:11] [DEBUG/tcp-client.c:226/tcp_connect()] tcp/client:try_connect: defaulting remote-port to 6996
> [May 25 10:13:11] [DEBUG/tcp-client.c:262/tcp_connect()] tcp/client:connect on 4 in progress (non-blocking)
> [May 25 10:13:11] [DEBUG/tcp-client.c:301/tcp_connect()] tcp/client:connection on 4 still in progress - try later
> ...
>
> Both clients are running with read-ahead and write_behind on and stat_prefetch off. All machines are on current mainline-2.4 as of patch 174. Since both bricks are seeing the same error does that mean the client is causing the error? The errors do not alway start at the same file but always happen after a lot of files are processed.
>
> Are there any specific tests you would like run?
>
> Best,
>
> Harris
>
>
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>


-- 
Anand V. Avati





[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux