Dear All- A replicated pair of servers in my GlusterFS 3.3.0 cluster have been experiencing extremely high load for the past few days after a replicated brick pair became 100% full. The GlusterFS related load on one of the servers was fluctuating at around 60, and this high load would swap to the other server periodically. When I noticed the full bricks I quickly extended the volume by creating new bricks on another server, and manually moved some data off the full bricks to create space for write operations. The fix-layout operation seemed to start normally but the load then increased even further. The server with the high load (then up to about 80) became very slow to respond and I noticed a lot of errors in the VOLNAME-rebalance.log files like the following. [2012-10-22 00:35:52.070364] W [socket.c:1512:__socket_proto_state_machine] 0-atmos-client-10: reading from socket failed. Error (Transport endpoint is not connected), peer (192.171.166.92:24052) [2012-10-22 00:35:52.070446] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xe7) [0x2b3fd905c547] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb2) [0x2b3fd905bf42] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b3fd905bbfe]))) 0-atmos-client-10: forced unwinding frame type(GlusterFS 3.1) op(INODELK(29)) called at 2012-10-22 00:35:45.454529 (xid=0x285951x) There have also been occasional errors like the following, referring to the pair of bricks that became 100% full. [2012-10-22 01:32:52.827044] W [client3_1-fops.c:5517:client3_1_readdir] 0-atmos-client-15: (00000000-0000-0000-0000-000000000000) remote_fd is -1. EBADFD [2012-10-22 09:49:21.103066] W [client3_1-fops.c:5628:client3_1_readdirp] 0-atmos-client-14: (00000000-0000-0000-0000-000000000000) remote_fd is -1. EBADFD The log files from the bricks that were 100% full have a lot of these errors in, from the period after I freed up some space on them. [2012-10-22 00:40:56.246075] E [server.c:176:server_submit_reply] (-->/usr/lib64/libglusterfs.so.0(default_inodelk_cbk+0xa4) [0x361da23e84] (-->/usr/lib64/glusterfs/3.3.0/xlator/debug/io-stats.so(io_stats_inodelk_cbk+0xd8) [0x2aaaabd74d48] (-->/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_inodelk_cbk+0x10b) [0x2aaaabf9742b]))) 0-: Reply submission failed [2012-10-22 00:40:56.246117] I [server-helpers.c:629:server_connection_destroy] 0-atmos-server: destroyed connection of bdan10.nerc-essc.ac.uk-13609-2012/10/21-23:04:53:323865-atmos-client-15-0 All these errors have only occurred on the replicated pair of servers that had suffered from 100% full bricks. I don't know if the errors are being caused by the high load (resulting in poor communication with other peers for example) or if the high load is the result of replication and/or distribution errors. I have tried various things to bring the load down, including un-mounting the volume and stopping the fix-layout operation, but the only thing that works is stopping the volume. Obviously I can't do that for long because people need to use the data, but with the load as high as it is data access is very slow and users are experiencing a lot of temporary I/O errors. Bricks from several volumes are on those servers so everybody in the department is being affected by this problem. I thought at first that the load was being caused by self-heal operations fixing errors caused by write failures that occurred when the bricks were full, but it is glusterfs threads that are causing the high load, not glustershd. Can anyone suggest a way to bring the load down so people can access the data properly again? Also, can I trust GlusterFS to eventually self-heal the errors causing the above error messages? Regards, -Dan.