Failed rebalance - lost files, inaccessible files, permission issues

gluster at elyograg.org (Shawn Heisey) · Tue, 12 Nov 2013 18:25:03 -0700

On 11/9/2013 2:39 AM, Shawn Heisey wrote:
> They are from the same log file - the one that I put on my dropbox
> account and linked in the original message.  They are consecutive log
> entries.

Further info from our developer that is looking deeper into these problems:

------------
Ouch.  I know why the rebalance stopped.  The host simply ran out of 
memory.  From the messages file:

Nov  2 21:55:30 slc01dfs001a kernel: VFS: file-max limit 2438308 reached
Nov  2 21:55:31 slc01dfs001a kernel: automount invoked oom-killer: 
gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
Nov  2 21:55:31 slc01dfs001a kernel: automount cpuset=/ mems_allowed=0
Nov  2 21:55:31 slc01dfs001a kernel: Pid: 2810, comm: automount Not 
tainted 2.6.32-358.2.1.el6.centos.plus.x86_64 #1

That "file max limit" line actually goes back to the beginning of Nov. 
2, and happened on all four hosts.  It is because of a file descriptor 
leak and was fixed in 3.3.2: 
https://bugzilla.redhat.com/show_bug.cgi?id=928631

This is unconnected to the file corruption/loss which started much 
earlier.  I'm still trying to understand this part.  I noticed that 
three of the hosts reported successful rebalancing on the same day we 
started losing files.  I am not sure how rebalancing was distributed 
among the hosts, and if the load on the other hosts was enough to keep 
things stable until they stopped.
------------

I gather that we should be at least on 3.3.2, but I suspect that a 
number of other bugs might be a problem unless we go to 3.4.1.  The 
rebalance status output is below.  All hosts except "localhost" on this 
status were reading "completed" a very short time after I started the 
rebalance.  The "localhost" line continued to increment until the 
rebalance died four days after starting.

[root at slc01dfs001a ~]# gluster volume rebalance mdfs status
                                     Node Rebalanced-files          size 
       scanned      failures         status
                                ---------      -----------   ----------- 
   -----------   -----------   ------------
                                localhost          1121514         1.5TB 
       9020514       1777661         failed
                                slc01nas1                0        0Bytes 
      13638699             0      completed
                             slc01dfs002a                0        0Bytes 
      13638699             1      completed
                             slc01dfs001b                0        0Bytes 
      13638699             0      completed
                             slc01dfs002b                0        0Bytes 
      13638700             0      completed
                                slc01nas2                0        0Bytes 
      13638699             0      completed

Thanks,
Shawn