Re: Skipped files during rebalance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Rafi,

Thanks for submitting a patch.

@DHT, I have two additional questions / problems.

1. When doing a rebalance (with data) RAM consumption on the nodes goes dramatically high, eg out of 196 GB available per node, RAM usage would fill up to 195.6 GB. This seems quite excessive and strange to me. 

2. As you can see, the rebalance (with data) failed as one endpoint becomes unconnected (even though it still is connected). I’m thinking this could be due to the high RAM usage?

Thank you for your help,

Christophe

Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  

T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb

Facebook  Twitter  Google Plus  Linkedin  skype

----
This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original message and any copies. 
----

  

On 17 Aug 2015, at 11:27, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> wrote:



On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
Dear all,
 
I have successfully added a new node to our setup, and finally managed to get a successful fix-layout run as well with no errors.
 
Now, as per the documentation, I started a gluster volume rebalance live start task and I see many skipped files. 
The error log contains then entires as follows for each skipped file.
 
[2015-08-16 20:23:30.591161] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
[2015-08-16 20:23:30.768391] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
[2015-08-16 20:23:30.804811] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
[2015-08-16 20:23:30.805201] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
[2015-08-16 20:23:30.880037] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
[2015-08-16 20:23:31.038236] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
[2015-08-16 20:23:31.259762] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
[2015-08-16 20:23:31.333764] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
[2015-08-16 20:23:31.340190] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006007004.flex lookup failed
 
Update: one of the rebalance tasks now failed.
 
@Rafi, I got the same error as Friday except this time with data.

Packets that carrying the ping request could be waiting in the queue during the whole time-out period, because of the heavy traffic in the network. I have sent a patch for this. You can track the status here : http://review.gluster.org/11935


 
[2015-08-16 20:24:34.533167] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server 192.168.123.104:49164 has not responded in the last 42 seconds, disconnecting.
[2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/li
bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
[2015-08-16 20:24:34.533672] E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote operation failed [Transport endpoint is not connected]
[2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/li
bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.303938 (xid=0x5dd4d7)
[2015-08-16 20:24:34.534347] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
[2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8)
[2015-08-16 20:24:34.534579] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex: failed to migrate data
[2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db)
[2015-08-16 20:24:34.534745] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex: failed to migrate data
[2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc)
[2015-08-16 20:24:34.535232] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex: failed to migrate data
[2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd)
[2015-08-16 20:24:34.536069] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex: failed to migrate data
[2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de)
[2015-08-16 20:24:34.536339] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex lookup failed
[2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df)
[2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0)
[2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1)
[2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2)
[2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-16 20:23:52.530107 (xid=0x5dd4e3)
[2015-08-16 20:24:34.538475] E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote operation failed [Transport endpoint is not connected]
The message "E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote operation failed [Transport endpoint is not connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and [2015-08-16 20:24:34.538535]
[2015-08-16 20:24:34.538584] E [MSGID: 109023] [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate file failed: 002004003.flex lookup failed
[2015-08-16 20:24:34.538904] E [MSGID: 109023] [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate file failed: 003009008.flex lookup failed
[2015-08-16 20:24:34.539724] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex lookup failed
[2015-08-16 20:24:34.539820] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)
[2015-08-16 20:24:34.540031] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1
[2015-08-16 20:24:34.540691] E [MSGID: 114031] [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote operation failed. Path: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex [Transport endpoint is not connected]
[2015-08-16 20:24:34.541152] E [MSGID: 114031] [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote operation failed. Path: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex [Transport endpoint is not connected]
[2015-08-16 20:24:34.541331] E [MSGID: 114031] [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote operation failed. Path: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex [Transport endpoint is not connected]
[2015-08-16 20:24:34.541486] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol
[2015-08-16 20:24:34.541572] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs
[2015-08-16 20:24:34.541639] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs
 
Any help would be greatly appreciated.
CCing dht teams to give you better idea about why rebalance failed/ and about huge memory consumption by rebalance process (200GB RAM) .

Regards
Rafi KC



 
Thanks,
 
--
Christophe

Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  

T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb

----
This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original message and any copies. 
----

  
 



_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux