Rhesa, the random bail+reconnection bug has been fixed. please let us know in case you come across any issues. thanks, avati 2007/7/8, Rhesa Rozendaal <gluster@xxxxxxxxx>: > > I just came back from dinner to find the client (patch-284, same spec file > as > before) pegging the machine: > > 13268 root 16 0 601m 527m 912 S 100 26.1 174:05.69 [glusterfs] > > (the load on the machine was 35, so the 100% cpu doesn't really do it > justice) > > Here's the relevant portion of the server log: > > 2007-07-07 21:03:45 E [protocol.c:262:gf_block_unserialize_transport] > libglusterfs/protocol: full_read of header failed: peer (192.168.64.186) > 2007-07-07 21:03:45 C [tcp.c:81:tcp_disconnect] server: connection > disconnected > 2007-07-07 21:05:45 E [protocol.c:262:gf_block_unserialize_transport] > libglusterfs/protocol: full_read of header failed: peer (192.168.64.186) > 2007-07-07 21:05:45 C [tcp.c:81:tcp_disconnect] server: connection > disconnected > 2007-07-07 21:09:45 E [protocol.c:262:gf_block_unserialize_transport] > libglusterfs/protocol: full_read of header failed: peer (192.168.64.186) > 2007-07-07 21:09:45 C [tcp.c:81:tcp_disconnect] server: connection > disconnected > > (it actually started showing those from 20:33, repeating the same block > once > every 4 minutes or so). > > The client log is probably more interesting: > > 2007-07-07 21:09:47 W [client-protocol.c:207:call_bail] client31: > activating > bail-out. pending frames = 1. last sent = 2007-07-07 21:05:47. last > received = > 1970-01-01 00:00:00transport-timeout = 120 > 2007-07-07 21:09:47 C [client-protocol.c:215:call_bail] client31: bailing > transport > 2007-07-07 21:09:47 D [tcp.c:131:cont_hand] tcp: forcing poll/read/write > to > break on blocked socket (if any) > 2007-07-07 21:09:47 W [client-protocol.c:4043:client_protocol_cleanup] > client31: cleaning up state in transport object 0x8690028 > 2007-07-07 21:09:47 W [client-protocol.c:4092:client_protocol_cleanup] > client31: forced unwinding frame type(1) op(2) > 2007-07-07 21:09:47 C [tcp.c:81:tcp_disconnect] client31: connection > disconnected > 2007-07-07 21:09:47 E [client-protocol.c:328:client_protocol_xfer] > client31: > transport_submit failed > 2007-07-07 21:09:47 W [client-protocol.c:4012:client_protocol_reconnect] > client31: attempting reconnect > 2007-07-07 21:09:47 D [tcp-client.c:71:tcp_connect] client31: socket fd = > 4 > 2007-07-07 21:09:47 D [tcp-client.c:89:tcp_connect] client31: finalized on > port `1018' > 2007-07-07 21:09:47 D [tcp-client.c:110:tcp_connect] client31: defaulting > remote-port to 6996 > 2007-07-07 21:09:47 D [tcp-client.c:142:tcp_connect] client31: connect on > 4 in > progress (non-blocking) > 2007-07-07 21:09:47 D [tcp-client.c:186:tcp_connect] client31: connection > on 4 > success > 2007-07-07 21:09:47 D [client-protocol.c:4578:notify] client31: got > GF_EVENT_CHILD_UP > 2007-07-07 21:09:47 D [client-protocol.c > :4341:client_protocol_handshake_reply] > client31: reply frame has callid: 424242 > 2007-07-07 21:09:47 D [client-protocol.c > :4375:client_protocol_handshake_reply] > client31: SETVOLUME on remote-host succeeded > 2007-07-07 21:09:48 W [client-protocol.c:4020:client_protocol_reconnect] > client31: breaking reconnect chain > > That block repeats 3 times (with 4 minute intervals), after which I see > about > 600 lines like these: > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy > inode(221481587) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy > inode(114199670) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy > inode(221481586) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy > inode(221481578) > > [snip 100 similar lines] > > 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client01: > forget > inode (221284204) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client01/inode: > destroy > inode(221284204) > 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client03: > forget > inode (355043120) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client03/inode: > destroy > inode(355043120) > 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client02: > forget > inode (84248694) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client02/inode: > destroy > inode(84248694) > 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client31: > forget > inode (92176667) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client31/inode: > destroy > inode(92176667) > 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] ns: forget > inode > (221284203) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] ns/inode: destroy > inode(221284203) > 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] export/inode: destroy > inode(221284203) > > [snip 500 similar lines] > > Note that those were all in the same second. > > Then 3 minutes silence, after which the "activating bail-out" sequence > starts > again, all the way through the forget/destroy inode block. > > By now, I've killed the client, and will restart the "cp -ru" process I > was > running. I'll report back if anything happens. > > I doubt this is enough info for you to figure this out, so let me know if > there's anything I can do to help pinpoint this odd reconnection behavior. > > Rhesa > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > http://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Anand V. Avati