I have the same problem with replicate + autoscaling! disabling autoscaling do things better. The main reason, I think, are files opened for a long time without updates... the servers simply lost connection to *every* clients. 2009/6/2 Shehjar Tikoo <shehjart@xxxxxxxxxxx>: > > Hi > >> >> Also, avoid using autoscaling in io-threads for now. >> >> -Shehjar >> >> > > -Shehjar > > Alpha Electronics wrote: >> >> Thanks for looking into this. We do use io-threads. Here is the server >> config: >> : volume brick1-posix >> 2: type storage/posix >> 3: option directory /mnt/brick1 >> 4: end-volume >> 5: >> 6: volume brick2-posix >> 7: type storage/posix >> 8: option directory /mnt/brick2 >> 9: end-volume >> 10: >> 11: >> 12: volume brick1-locks >> 13: type features/locks >> 14: subvolumes brick1-posix >> 15: end-volume >> 16: >> 17: volume brick2-locks >> 18: type features/locks >> 19: subvolumes brick2-posix >> 20: end-volume >> 21: >> 22: volume brick1 >> 23: type performance/io-threads >> 24: option min-threads 16 >> 25: option autoscaling on >> 26: subvolumes brick1-locks >> 27: end-volume >> 28: >> 29: volume brick2 >> 30: type performance/io-threads >> 31: option min-threads 16 >> 32: option autoscaling on >> 33: subvolumes brick2-locks >> 34: end-volume >> 35: >> 36: volume server >> 37: type protocol/server >> 38: option transport-type tcp >> 40: option auth.addr.brick1.allow * >> 41: option auth.addr.brick2.allow * >> 42: subvolumes brick1 brick2 >> 43: end-volume >> 44: >> >> >> >> On Sun, May 31, 2009 at 11:44 PM, Shehjar Tikoo <shehjart@xxxxxxxxxxx >> <mailto:shehjart@xxxxxxxxxxx>> wrote: >> >> Alpha Electronics wrote: >> >> We are testing the glusterfs before recommending them to >> enterprise clients. We found that the file system always hang >> after running for about 2 days. after killing the server side >> process and then restart, everything goes back to normal. >> >> >> What is the server config? >> If you're not using io-threads on the server, I suggest you do, >> because it does basic load-balancing to avoid timeouts. >> >> Also, avoid using autoscaling in io-threads for now. >> >> -Shehjar >> >> >> Here is the spec and error logged: >> GlusterFS version: v2.0.1 >> >> Client volume: >> volume brick_1 >> type protocol/client >> option transport-type tcp/client >> option remote-port 7777 # Non-default port >> option remote-host server1 >> option remote-subvolume brick >> end-volume >> >> volume brick_2 >> type protocol/client >> option transport-type tcp/client >> option remote-port 7777 # Non-default port >> option remote-host server2 >> option remote-subvolume brick >> end-volume >> >> volume bricks >> type cluster/distribute >> subvolumes brick_1 brick_2 >> end-volume >> >> Error logged on client side through /var/log/glusterfs.log >> [2009-05-29 14:58:55] E [client-protocol.c:292:call_bail] >> brick_1: bailing out frame LK(28) frame sent = 2009-05-29 >> 14:28:54. frame-timeout = 1800 >> [2009-05-29 14:58:55] W [fuse-bridge.c:2284:fuse_setlk_cbk] >> glusterfs-fuse: 106850788: ERR => -1 (Transport endpoint is not >> connected) >> error logged on server >> [2009-05-29 14:59:15] E [client-protocol.c:292:call_bail] >> brick_2: bailing out frame LK(28) frame sent = 2009-05-29 >> 14:29:05. frame-timeout = 1800 >> [2009-05-29 14:59:15] W [fuse-bridge.c:2284:fuse_setlk_cbk] >> glusterfs-fuse: 106850860: ERR => -1 (Transport endpoint is not >> connected) >> >> There is error message logged on server side after 1 hour in >> /var/log/messages: >> May 29 16:04:16 server2 winbindd[3649]: [2009/05/29 16:05:16, 0] >> lib/util_sock.c:write_data(564) >> May 29 16:04:16 server2 winbindd[3649]: write_data: write >> failure. Error = Connection reset by peer >> May 29 16:04:16 server2 winbindd[3649]: [2009/05/29 16:05:16, 0] >> libsmb/clientgen.c:write_socket(158) >> May 29 16:04:16 server2 winbindd[3649]: write_socket: Error >> writing 104 bytes to socket 18: ERRNO = Connection reset by peer >> May 29 16:04:16 server2 winbindd[3649]: [2009/05/29 16:05:16, 0] >> libsmb/clientgen.c:cli_send_smb(188) >> May 29 16:04:16 server2 winbindd[3649]: Error writing 104 >> bytes to client. -1 (Connection reset by peer) >> May 29 16:04:16 server2 winbindd[3649]: [2009/05/29 16:05:16, 0] >> libsmb/cliconnect.c:cli_session_setup_spnego(859) >> May 29 16:04:16 server2 winbindd[3649]: Kinit failed: Cannot >> contact any KDC for requested realm >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@xxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxx> >> http://lists.nongnu.org/mailman/listinfo/gluster-devel >> >> >> >> > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > http://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Rodrigo Azevedo Moreira da Silva Departamento de Física Universidade Federal de Pernambuco http://www.df.ufpe.br