Failed rebalance - lost files, inaccessible files, permission issues

gluster at elyograg.org (Shawn Heisey) · Tue, 26 Nov 2013 15:51:40 -0700

Here's what our own developer had to say about this:

On 11/8/2013 8:23 PM, Shawn Heisey wrote:
> When I looked at the individual cases of lost or corrupted files, one
> thing kept staring at me in the face until I recognized it:
>
> [2013-11-02 03:56:36.472170] I [dht-rebalance.c:647:dht_migrate_file]
> 0-mdfs-dht: /REDACTED/mdfs/AKG/akgphotos/docs/000/002: attempting to
> move from mdfs-replicate-2 to mdfs-replicate-12
> [2013-11-02 03:56:36.472186] I [dht-rebalance.c:647:dht_migrate_file]
> 0-mdfs-dht: /REDACTED/mdfs/AKG/akgphotos/docs/000/002: attempting to
> move from mdfs-replicate-2 to mdfs-replicate-12
> [2013-11-02 03:56:36.480567] I [dht-rebalance.c:647:dht_migrate_file]
> 0-mdfs-dht: /REDACTED/mdfs/AKG/akgphotos/docs/000/002: attempting to
> move from mdfs-replicate-2 to mdfs-replicate-12
>
> Three simultaneous processes on the same file!  Of course that would
> have undefined results, and be the cause of all our problems.  NFS may
> not be related after all.
>
> Tediously scrolling through the error log I found mostly errors where it
> refused to copy files from a more empty brick to a fuller brick, which
> makes perfect sense.  The wheels started falling off about 26 hours into
> the rebalance.
>
> [2013-10-29 23:13:17.193108] C
> [client-handshake.c:126:rpc_client_ping_timer_expired] 0-mdfs-client-1:
> server 10.116.0.22:24025 has not responded in the last 42 seconds,
> disconnecting.
> [2013-10-29 23:13:17.200616] E [rpc-clnt.c:373:saved_frames_unwind]
> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x36de60f808]
> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0)
> [0x36de60f4c0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
> [0x36de60ef2e]))) 0-mdfs-client-1: forced unwinding frame type(GlusterFS
> 3.1) op(STAT(1)) called at 2013-10-29 23:12:20.641732 (xid=0x1763733x)
> [2013-10-29 23:13:17.200659] W [client3_1-fops.c:474:client3_1_stat_cbk]
> 0-mdfs-client-1: remote operation failed: Transport endpoint is not
> connected
> [2013-10-29 23:13:17.200756] E [rpc-clnt.c:373:saved_frames_unwind]
> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x36de60f808]
> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0)
> [0x36de60f4c0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
> [0x36de60ef2e]))) 0-mdfs-client-1: forced unwinding frame type(GlusterFS
> Handshake) op(PING(3)) called at 2013-10-29 23:12:35.184481
> (xid=0x1763734x)
> [2013-10-29 23:13:17.200792] W [client-handshake.c:275:client_ping_cbk]
> 0-mdfs-client-1: timer must have expired
> [2013-10-29 23:13:17.217477] I [client.c:2090:client_rpc_notify]
> 0-mdfs-client-1: disconnected
> .
> .
> .
> [2013-10-29 23:13:28.198386] I
> [client-handshake.c:1636:select_server_supported_programs]
> 0-mdfs-client-1: Using Program GlusterFS 3.3.1, Num (1298437), Version
> (330)
> [2013-10-29 23:13:28.198803] I
> [client-handshake.c:1433:client_setvolume_cbk] 0-mdfs-client-1:
> Connected to 10.116.0.22:24025, attached to remote volume
> '/bricks/d00v00/mdfs'.
> [2013-10-29 23:13:28.198826] I
> [client-handshake.c:1445:client_setvolume_cbk] 0-mdfs-client-1: Server
> and Client lk-version numbers are not same, reopening the fds
> [2013-10-29 23:13:28.205629] I
> [client-handshake.c:1282:client_post_handshake] 0-mdfs-client-1: 42745
> fds open - Delaying child_up until they are re-opened
> [2013-10-29 23:13:28.205782] E [inode.c:1090:__inode_path]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(client_post_handshake+0x110)
> [0x7fb74bdf1c90]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(protocol_client_reopendir+0x107)
> [0x7fb74bdf1907] (-->/usr/lib64/libglusterfs.so.0(inode_path+0x4a)
> [0x36dde2ca1a]))) 0-: Assertion failed: 0
> [2013-10-29 23:13:28.205852] W [inode.c:1091:__inode_path]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(client_post_handshake+0x110)
> [0x7fb74bdf1c90]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(protocol_client_reopendir+0x107)
> [0x7fb74bdf1907] (-->/usr/lib64/libglusterfs.so.0(inode_path+0x4a)
> [0x36dde2ca1a]))) 0-mdfs-client-1: invalid inode
> [2013-10-29 23:13:28.205881] W
> [client-handshake.c:1108:protocol_client_reopendir] 0-mdfs-client-1:
> couldn't build path from inode 00000000-0000-0000-0000-000000000000
> [2013-10-29 23:13:28.205944] E [inode.c:1090:__inode_path]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(client_post_handshake+0x110)
> [0x7fb74bdf1c90]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(protocol_client_reopendir+0x107)
> [0x7fb74bdf1907] (-->/usr/lib64/libglusterfs.so.0(inode_path+0x4a)
> [0x36dde2ca1a]))) 0-: Assertion failed: 0
> [2013-10-29 23:13:28.205992] W [inode.c:1091:__inode_path]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(client_post_handshake+0x110)
> [0x7fb74bdf1c90]
> (-->/usr/lib64/glusterfs/3.3.1/xlator/protocol/client.so(protocol_client_reopendir+0x107)
> [0x7fb74bdf1907] (-->/usr/lib64/libglusterfs.so.0(inode_path+0x4a)
> [0x36dde2ca1a]))) 0-mdfs-client-1: invalid inode
> [2013-10-29 23:13:28.206009] W
> [client-handshake.c:1108:protocol_client_reopendir] 0-mdfs-client-1:
> couldn't build path from inode 00000000-0000-0000-0000-000000000000
>
> The assertion failures actually go on for thousands of lines even though
> only 3 seconds have elapsed.  Eventually it overcame whatever that
> handshake error means.  Not long after:
>
> [2013-10-29 23:13:49.611069] I [dht-rebalance.c:647:dht_migrate_file]
> 0-mdfs-dht: /REDACTED/mdfs/KPA/kpacontentminepix/docs/008/058:
> attempting to move from mdfs-replicate-1 to mdfs-replicate-6
> [2013-10-29 23:13:49.611582] I [dht-rebalance.c:647:dht_migrate_file]
> 0-mdfs-dht: /REDACTED/mdfs/KPA/kpacontentminepix/docs/008/058:
> attempting to move from mdfs-replicate-1 to mdfs-replicate-6
>
> and there's our first lost file.  From that point on, we were trashing
> files with simultaneous copy processes.  I think a process hung up, and
> a new one was spawned, and when the first process woke back up it didn't
> know enough to stop.

I duplicated our production setup as closely as possible on a testbed. 
There are some differences, mostly due to the fact that the testbed is 
comprised of inexpensive desktop-class machines with nowhere near as 
much CPU, RAM, or disk resources as the server-class production hardware.

The rebalance ultimately failed because it was killed by the linux OOM 
killer, similar to what happened in production.  It was due to actual 
memory usage, rather than hitting the system-wide max open file 
descriptor limit like it did on production.

It does not appear that I have any lost files, permission errors, or 
inaccessible files on the testbed.  I did not see the critical error 
showing the ping-timeout, though.

I cannot account for the ping timeout that the production system saw, 
which we suspect caused bug 859387 to surface.  There are no kernel log 
messages on any host aside from the gluster ping-timeout itself, no 
errors on the NIC statistics, no messages in the switch logs, and no 
errors on the switch interfaces.  On 10.100.116.22, the host that the 
critical log entry says timed out, the logfile 
etc-glusterfs-glusterd.vol.log-20131103 shows activity every few seconds 
throughout the entire 42 second period.

https://bugzilla.redhat.com/show_bug.cgi?id=859387

I would like to recreate the circumstances of the bug on the testbed and 
see if I have similar problems to what I did in production, but the bug 
gives no details for reproducing, other than bringing a brick down and 
then bringing it back up.  It doesn't say HOW to bring a brick down. 
Would sending a SIGSTOP to the brick process do it, followed by a 
SIGCONT after an interval greater than 42 seconds?

Thanks,
Shawn