(client) reconnect attempts cause load spike

Rhesa Rozendaal <gluster@xxxxxxxxx> · Sun, 08 Jul 2007 00:57:13 +0200

I just came back from dinner to find the client (patch-284, same spec file as 
before) pegging the machine:

13268 root      16   0  601m 527m  912 S  100 26.1 174:05.69 [glusterfs]

(the load on the machine was 35, so the 100% cpu doesn't really do it justice)

Here's the relevant portion of the server log:

2007-07-07 21:03:45 E [protocol.c:262:gf_block_unserialize_transport] 
libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:03:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected
2007-07-07 21:05:45 E [protocol.c:262:gf_block_unserialize_transport] 
libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:05:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected
2007-07-07 21:09:45 E [protocol.c:262:gf_block_unserialize_transport] 
libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:09:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected

(it actually started showing those from 20:33, repeating the same block once 
every 4 minutes or so).

The client log is probably more interesting:

2007-07-07 21:09:47 W [client-protocol.c:207:call_bail] client31: activating 
bail-out. pending frames = 1. last sent = 2007-07-07 21:05:47. last received = 
1970-01-01 00:00:00transport-timeout = 120
2007-07-07 21:09:47 C [client-protocol.c:215:call_bail] client31: bailing 
transport
2007-07-07 21:09:47 D [tcp.c:131:cont_hand] tcp: forcing poll/read/write to 
break on blocked socket (if any)
2007-07-07 21:09:47 W [client-protocol.c:4043:client_protocol_cleanup] 
client31: cleaning up state in transport object 0x8690028
2007-07-07 21:09:47 W [client-protocol.c:4092:client_protocol_cleanup] 
client31: forced unwinding frame type(1) op(2)
2007-07-07 21:09:47 C [tcp.c:81:tcp_disconnect] client31: connection disconnected
2007-07-07 21:09:47 E [client-protocol.c:328:client_protocol_xfer] client31: 
transport_submit failed
2007-07-07 21:09:47 W [client-protocol.c:4012:client_protocol_reconnect] 
client31: attempting reconnect
2007-07-07 21:09:47 D [tcp-client.c:71:tcp_connect] client31: socket fd = 4
2007-07-07 21:09:47 D [tcp-client.c:89:tcp_connect] client31: finalized on 
port `1018'
2007-07-07 21:09:47 D [tcp-client.c:110:tcp_connect] client31: defaulting 
remote-port to 6996
2007-07-07 21:09:47 D [tcp-client.c:142:tcp_connect] client31: connect on 4 in 
progress (non-blocking)
2007-07-07 21:09:47 D [tcp-client.c:186:tcp_connect] client31: connection on 4 
success
2007-07-07 21:09:47 D [client-protocol.c:4578:notify] client31: got 
GF_EVENT_CHILD_UP
2007-07-07 21:09:47 D [client-protocol.c:4341:client_protocol_handshake_reply] 
client31: reply frame has callid: 424242
2007-07-07 21:09:47 D [client-protocol.c:4375:client_protocol_handshake_reply] 
client31: SETVOLUME on remote-host succeeded
2007-07-07 21:09:48 W [client-protocol.c:4020:client_protocol_reconnect] 
client31: breaking reconnect chain

That block repeats 3 times (with 4 minute intervals), after which I see about 
600 lines like these:
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy 
inode(221481587)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy 
inode(114199670)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy 
inode(221481586)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy 
inode(221481578)

[snip 100 similar lines]

2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client01: forget 
inode (221284204)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client01/inode: destroy 
inode(221284204)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client03: forget 
inode (355043120)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client03/inode: destroy 
inode(355043120)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client02: forget 
inode (84248694)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client02/inode: destroy 
inode(84248694)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client31: forget 
inode (92176667)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client31/inode: destroy 
inode(92176667)
2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] ns: forget inode 
(221284203)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] ns/inode: destroy 
inode(221284203)
2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] export/inode: destroy 
inode(221284203)

[snip 500 similar lines]

Note that those were all in the same second.

Then 3 minutes silence, after which the "activating bail-out" sequence starts 
again, all the way through the forget/destroy inode block.

By now, I've killed the client, and will restart the "cp -ru" process I was 
running. I'll report back if anything happens.

I doubt this is enough info for you to figure this out, so let me know if 
there's anything I can do to help pinpoint this odd reconnection behavior.

Rhesa