I just came back from dinner to find the client (patch-284, same spec file as
before) pegging the machine:
13268 root 16 0 601m 527m 912 S 100 26.1 174:05.69 [glusterfs]
(the load on the machine was 35, so the 100% cpu doesn't really do it justice)
Here's the relevant portion of the server log:
2007-07-07 21:03:45 E [protocol.c:262:gf_block_unserialize_transport]
libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:03:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected
2007-07-07 21:05:45 E [protocol.c:262:gf_block_unserialize_transport]
libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:05:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected
2007-07-07 21:09:45 E [protocol.c:262:gf_block_unserialize_transport]
libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:09:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected
(it actually started showing those from 20:33, repeating the same block once
every 4 minutes or so).
The client log is probably more interesting:
2007-07-07 21:09:47 W [client-protocol.c:207:call_bail] client31: activating
bail-out. pending frames = 1. last sent = 2007-07-07 21:05:47. last received =
1970-01-01 00:00:00transport-timeout = 120
2007-07-07 21:09:47 C [client-protocol.c:215:call_bail] client31: bailing
transport
2007-07-07 21:09:47 D [tcp.c:131:cont_hand] tcp: forcing poll/read/write to
break on blocked socket (if any)
2007-07-07 21:09:47 W [client-protocol.c:4043:client_protocol_cleanup]
client31: cleaning up state in transport object 0x8690028
2007-07-07 21:09:47 W [client-protocol.c:4092:client_protocol_cleanup]
client31: forced unwinding frame type(1) op(2)
2007-07-07 21:09:47 C [tcp.c:81:tcp_disconnect] client31: connection disconnected
2007-07-07 21:09:47 E [client-protocol.c:328:client_protocol_xfer] client31:
transport_submit failed
2007-07-07 21:09:47 W [client-protocol.c:4012:client_protocol_reconnect]
client31: attempting reconnect
2007-07-07 21:09:47 D [tcp-client.c:71:tcp_connect] client31: socket fd = 4
2007-07-07 21:09:47 D [tcp-client.c:89:tcp_connect] client31: finalized on
port `1018'
2007-07-07 21:09:47 D [tcp-client.c:110:tcp_connect] client31: defaulting
remote-port to 6996
2007-07-07 21:09:47 D [tcp-client.c:142:tcp_connect] client31: connect on 4 in
progress (non-blocking)
2007-07-07 21:09:47 D [tcp-client.c:186:tcp_connect] client31: connection on 4
success
2007-07-07 21:09:47 D [client-protocol.c:4578:notify] client31: got
GF_EVENT_CHILD_UP
2007-07-07 21:09:47 D [client-protocol.c:4341:client_protocol_handshake_reply]
client31: reply frame has callid: 424242
2007-07-07 21:09:47 D [client-protocol.c:4375:client_protocol_handshake_reply]
client31: SETVOLUME on remote-host succeeded
2007-07-07 21:09:48 W [client-protocol.c:4020:client_protocol_reconnect]
client31: breaking reconnect chain
That block repeats 3 times (with 4 minute intervals), after which I see about
600 lines like these:
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
inode(221481587)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
inode(114199670)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
inode(221481586)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
inode(221481578)
[snip 100 similar lines]
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client01: forget
inode (221284204)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client01/inode: destroy
inode(221284204)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client03: forget
inode (355043120)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client03/inode: destroy
inode(355043120)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client02: forget
inode (84248694)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client02/inode: destroy
inode(84248694)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client31: forget
inode (92176667)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client31/inode: destroy
inode(92176667)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] ns: forget inode
(221284203)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] ns/inode: destroy
inode(221284203)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] export/inode: destroy
inode(221284203)
[snip 500 similar lines]
Note that those were all in the same second.
Then 3 minutes silence, after which the "activating bail-out" sequence starts
again, all the way through the forget/destroy inode block.
By now, I've killed the client, and will restart the "cp -ru" process I was
running. I'll report back if anything happens.
I doubt this is enough info for you to figure this out, so let me know if
there's anything I can do to help pinpoint this odd reconnection behavior.
Rhesa