On 11/9/2013 2:39 AM, Shawn Heisey wrote: > They are from the same log file - the one that I put on my dropbox > account and linked in the original message. They are consecutive log > entries. Further info from our developer that is looking deeper into these problems: ------------ Ouch. I know why the rebalance stopped. The host simply ran out of memory. From the messages file: Nov 2 21:55:30 slc01dfs001a kernel: VFS: file-max limit 2438308 reached Nov 2 21:55:31 slc01dfs001a kernel: automount invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0 Nov 2 21:55:31 slc01dfs001a kernel: automount cpuset=/ mems_allowed=0 Nov 2 21:55:31 slc01dfs001a kernel: Pid: 2810, comm: automount Not tainted 2.6.32-358.2.1.el6.centos.plus.x86_64 #1 That "file max limit" line actually goes back to the beginning of Nov. 2, and happened on all four hosts. It is because of a file descriptor leak and was fixed in 3.3.2: https://bugzilla.redhat.com/show_bug.cgi?id=928631 This is unconnected to the file corruption/loss which started much earlier. I'm still trying to understand this part. I noticed that three of the hosts reported successful rebalancing on the same day we started losing files. I am not sure how rebalancing was distributed among the hosts, and if the load on the other hosts was enough to keep things stable until they stopped. ------------ I gather that we should be at least on 3.3.2, but I suspect that a number of other bugs might be a problem unless we go to 3.4.1. The rebalance status output is below. All hosts except "localhost" on this status were reading "completed" a very short time after I started the rebalance. The "localhost" line continued to increment until the rebalance died four days after starting. [root at slc01dfs001a ~]# gluster volume rebalance mdfs status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 1121514 1.5TB 9020514 1777661 failed slc01nas1 0 0Bytes 13638699 0 completed slc01dfs002a 0 0Bytes 13638699 1 completed slc01dfs001b 0 0Bytes 13638699 0 completed slc01dfs002b 0 0Bytes 13638700 0 completed slc01nas2 0 0Bytes 13638699 0 completed Thanks, Shawn