(client) reconnect attempts cause load spike

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I just came back from dinner to find the client (patch-284, same spec file as before) pegging the machine:

13268 root      16   0  601m 527m  912 S  100 26.1 174:05.69 [glusterfs]

(the load on the machine was 35, so the 100% cpu doesn't really do it justice)

Here's the relevant portion of the server log:

2007-07-07 21:03:45 E [protocol.c:262:gf_block_unserialize_transport] libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:03:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected
2007-07-07 21:05:45 E [protocol.c:262:gf_block_unserialize_transport] libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:05:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected
2007-07-07 21:09:45 E [protocol.c:262:gf_block_unserialize_transport] libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
2007-07-07 21:09:45 C [tcp.c:81:tcp_disconnect] server: connection disconnected

(it actually started showing those from 20:33, repeating the same block once every 4 minutes or so).

The client log is probably more interesting:

2007-07-07 21:09:47 W [client-protocol.c:207:call_bail] client31: activating bail-out. pending frames = 1. last sent = 2007-07-07 21:05:47. last received = 1970-01-01 00:00:00transport-timeout = 120 2007-07-07 21:09:47 C [client-protocol.c:215:call_bail] client31: bailing transport 2007-07-07 21:09:47 D [tcp.c:131:cont_hand] tcp: forcing poll/read/write to break on blocked socket (if any) 2007-07-07 21:09:47 W [client-protocol.c:4043:client_protocol_cleanup] client31: cleaning up state in transport object 0x8690028 2007-07-07 21:09:47 W [client-protocol.c:4092:client_protocol_cleanup] client31: forced unwinding frame type(1) op(2)
2007-07-07 21:09:47 C [tcp.c:81:tcp_disconnect] client31: connection disconnected
2007-07-07 21:09:47 E [client-protocol.c:328:client_protocol_xfer] client31: transport_submit failed 2007-07-07 21:09:47 W [client-protocol.c:4012:client_protocol_reconnect] client31: attempting reconnect
2007-07-07 21:09:47 D [tcp-client.c:71:tcp_connect] client31: socket fd = 4
2007-07-07 21:09:47 D [tcp-client.c:89:tcp_connect] client31: finalized on port `1018' 2007-07-07 21:09:47 D [tcp-client.c:110:tcp_connect] client31: defaulting remote-port to 6996 2007-07-07 21:09:47 D [tcp-client.c:142:tcp_connect] client31: connect on 4 in progress (non-blocking) 2007-07-07 21:09:47 D [tcp-client.c:186:tcp_connect] client31: connection on 4 success 2007-07-07 21:09:47 D [client-protocol.c:4578:notify] client31: got GF_EVENT_CHILD_UP 2007-07-07 21:09:47 D [client-protocol.c:4341:client_protocol_handshake_reply] client31: reply frame has callid: 424242 2007-07-07 21:09:47 D [client-protocol.c:4375:client_protocol_handshake_reply] client31: SETVOLUME on remote-host succeeded 2007-07-07 21:09:48 W [client-protocol.c:4020:client_protocol_reconnect] client31: breaking reconnect chain

That block repeats 3 times (with 4 minute intervals), after which I see about 600 lines like these: 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy inode(221481587) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy inode(114199670) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy inode(221481586) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy inode(221481578)

[snip 100 similar lines]

2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client01: forget inode (221284204) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client01/inode: destroy inode(221284204) 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client03: forget inode (355043120) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client03/inode: destroy inode(355043120) 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client02: forget inode (84248694) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client02/inode: destroy inode(84248694) 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client31: forget inode (92176667) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client31/inode: destroy inode(92176667) 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] ns: forget inode (221284203) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] ns/inode: destroy inode(221284203) 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] export/inode: destroy inode(221284203)

[snip 500 similar lines]

Note that those were all in the same second.

Then 3 minutes silence, after which the "activating bail-out" sequence starts again, all the way through the forget/destroy inode block.

By now, I've killed the client, and will restart the "cp -ru" process I was running. I'll report back if anything happens.

I doubt this is enough info for you to figure this out, so let me know if there's anything I can do to help pinpoint this odd reconnection behavior.

Rhesa




[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux