On Thu, 10 May 2007, Brent A Nelson wrote:
[May 10 18:14:18] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0
bytes r/w instead of 113 (errno=115)
[May 10 18:14:18] [CRITICAL/tcp.c:81/tcp_disconnect()]
transport/tcp:share4-1: connection to server disconnected
[May 10 18:14:18] [CRITICAL/client-protocol.c:218/call_bail()]
client/protocol:bailing transport
[May 10 18:14:18] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0
bytes r/w instead of 113 (errno=9)
[May 10 18:14:18] [CRITICAL/tcp.c:81/tcp_disconnect()]
transport/tcp:share4-0: connection to server disconnected
[May 10 18:14:18] [ERROR/client-protocol.c:204/client_protocol_xfer()]
protocol/client:transport_submit failed
[May 10 18:14:18] [ERROR/client-protocol.c:204/client_protocol_xfer()]
protocol/client:transport_submit failed
[May 10 18:14:19] [CRITICAL/client-protocol.c:218/call_bail()]
client/protocol:bailing transport
[May 10 18:14:19] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0
bytes r/w instead of 113 (errno=115)
[May 10 18:14:19] [CRITICAL/tcp.c:81/tcp_disconnect()]
transport/tcp:share4-0: connection to server disconnected
[May 10 18:14:19] [ERROR/client-protocol.c:204/client_protocol_xfer()]
protocol/client:transport_submit failed
I've seen the "0 bytes r/w instead of 113" message plenty of times in the
past (with older GlusterFS versions), although it was apparently harmless
before. It looks like the code now considers this to be a disconnection and
tries to reconnect. For some reason, when it does manage to reconnect, it
nevertheless results in an I/O error. I wonder if this relates to a previous
issue I mentioned with real disconnects (node dies or glusterfsd is
restarted), where the first access after a failure (at least for ls or df)
results in an error, but the next attempt succeeds? Seems like an issue with
the reconnection logic (and some sort of glitch masquerading as a disconnect
in the first place)... This is probably the real problem that is triggering
the read-ahead crash (i.e., the read-ahead crash would not be triggered in my
test case if it weren't for this issue).
Well, it looks like I can reproduce this behavior (but, so far, not the
memory leak), on a much simpler setup, no NFS required. I was copying my
test area (with several 10GB files) to a really simple GlusterFS (one
share, no afr, no unify, glusterfsd on the same machine), when I hit the
disconnect issue (after a few files successfully copied). This looked
like an issue with protocol/client and/or protocol/server, but I thought
it would be a good idea to narrow things down a bit...
Thanks,
Brent