two stability glitches after continuous file operations for a month

daimh at umich.edu (Manhong Dai) · Mon, 01 Dec 2008 11:54:16 -0500

Hi,

	After a month's file operations, which included coping 20 million of
small files and about 20 thousand of cluster jobs, I am  overall
satisfied except two stability glitches.

1. A small portion (about 1%?) of jobs got an error of "transport
endpoint not connected", and output file is incomplete. This error
happened on random computing nodes, and it doesn't affect subsequent
jobs on the same node. An example of error message of glusterfsd is 
2008-11-19 23:09:51 E [protocol.c:271:gf_block_unserialize_transport]
server: EOF from peer (172.20.102.2:1022)

Error of glusterfs is either (looks to be caused by brick)
2008-11-19 23:09:52 C [client-protocol.c:212:call_bail] muskie-brick:
bailing transport
2008-11-19 23:09:52 E [client-protocol.c:4834:client_protocol_cleanup]
muskie-brick: forced unwinding frame type(1) op(14) reply=@0x67e2150
2008-11-19 23:09:52 E [client-protocol.c:3254:client_write_cbk]
muskie-brick: no proper reply from server, returning ENOTCONN
2008-11-19 23:09:56 E [write-behind.c:602:wb_writev] wb: delayed error :
107

or (caused by namespace)
2008-11-28 20:47:53 C [client-protocol.c:212:call_bail] muskie-ns:
bailing transport
2008-11-28 20:47:53 E [client-protocol.c:4834:client_protocol_cleanup]
muskie-ns: forced unwinding frame type(1) op(40) reply=@0x1b447cc0
2008-11-28 20:47:53 E [client-protocol.c:4613:client_checksum_cbk]
muskie-ns: no proper reply from server, returning ENOTCONN
2008-11-28 20:47:53 E [client-protocol.c:325:client_protocol_xfer]
muskie-ns: transport_submit failed

2. Right now the process 'glusterfs' takes 1785M virt mem, and 1500 RES
mem, according to top. I hope this is not a memory leak, or at least
there should be a way to reduce memory usage without remounting it.

If somebody can shed some light on these issues, I appreciate it. Just
let me know if you need more detailed information.

Best,
Manhong