Dear Susant, Apparently the glistered process was stuck in a strange state. So we restarted the glusterd process on stor106. This allowed us to stop the volume, and reboot. I will start a new rebalance now, and will get the information you asked during the rebalance operation. I think it makes more sense to post the logs of this new rebalance operation. Kind regards, — Christophe > On 19 Aug 2015, at 08:49, Susant Palai <spalai@xxxxxxxxxx> wrote: > > Hi Christophe, > Forgot to ask you to post the rebalance and glusterd logs. > > Regards, > Susant > > > ----- Original Message ----- >> From: "Susant Palai" <spalai@xxxxxxxxxx> >> To: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> >> Sent: Wednesday, August 19, 2015 11:44:35 AM >> Subject: Re: Skipped files during rebalance >> >> Comments inline. >> >> ----- Original Message ----- >>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >>> To: "Susant Palai" <spalai@xxxxxxxxxx> >>> Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" >>> <nbalacha@xxxxxxxxxx>, "Shyamsundar >>> Ranganathan" <srangana@xxxxxxxxxx>, "Mohammed Rafi K C" >>> <rkavunga@xxxxxxxxxx>, "Gluster Devel" >>> <gluster-devel@xxxxxxxxxxx> >>> Sent: Tuesday, August 18, 2015 8:08:41 PM >>> Subject: Re: Skipped files during rebalance >>> >>> Hi Susan, >>> >>> Thank you for the response. >>> >>>> On 18 Aug 2015, at 10:45, Susant Palai <spalai@xxxxxxxxxx> wrote: >>>> >>>> Hi Christophe, >>>> >>>> Need some info regarding the high mem-usage. >>>> >>>> 1. Top output: To see whether any other process eating up memory. >> >> I will be interested to know the memory usage of all the gluster process >> referring to the high mem-usage. These process includes glusterfsd, >> glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs). >> >> >>>> 2. Gluster volume info >>> >>> root@highlander ~]# gluster volume info >>> >>> Volume Name: live >>> Type: Distribute >>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527 >>> Status: Started >>> Number of Bricks: 9 >>> Transport-type: tcp >>> Bricks: >>> Brick1: stor104:/zfs/brick0/brick >>> Brick2: stor104:/zfs/brick1/brick >>> Brick3: stor104:/zfs/brick2/brick >>> Brick4: stor106:/zfs/brick0/brick >>> Brick5: stor106:/zfs/brick1/brick >>> Brick6: stor106:/zfs/brick2/brick >>> Brick7: stor105:/zfs/brick0/brick >>> Brick8: stor105:/zfs/brick1/brick >>> Brick9: stor105:/zfs/brick2/brick >>> Options Reconfigured: >>> diagnostics.count-fop-hits: on >>> diagnostics.latency-measurement: on >>> server.allow-insecure: on >>> cluster.min-free-disk: 1% >>> diagnostics.brick-log-level: ERROR >>> diagnostics.client-log-level: ERROR >>> cluster.data-self-heal-algorithm: full >>> performance.cache-max-file-size: 4MB >>> performance.cache-refresh-timeout: 60 >>> performance.cache-size: 1GB >>> performance.client-io-threads: on >>> performance.io-thread-count: 32 >>> performance.write-behind-window-size: 4MB >>> >>>> 3. Is rebalance process still running? If yes can you point to specific >>>> mem >>>> usage by rebalance process? The high mem-usage was seen during rebalance >>>> or even post rebalance? >>> >>> I would like to restart the rebalance process since it failed… But I can’t >>> as >>> the volume cannot be stopped (I wanted to reboot the servers to have a >>> clean >>> testing grounds). >>> >>> Here are the logs from the three nodes: >>> http://paste.fedoraproject.org/256183/43989079 >>> >>> Maybe you could help me figure out how to stop the volume? >>> >>> This is what happens >>> >>> [root@highlander ~]# gluster volume rebalance live stop >>> volume rebalance: live: failed: Rebalance not started. >> >> Requesting glusterd team to give input. >>> >>> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop" >>> volume rebalance: live: failed: Rebalance not started. >>> >>> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop" >>> volume rebalance: live: failed: Rebalance not started. >>> >>> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop" >>> volume rebalance: live: failed: Rebalance not started. >>> >>> [root@highlander ~]# gluster volume rebalance live stop >>> volume rebalance: live: failed: Rebalance not started. >>> >>> [root@highlander ~]# gluster volume stop live >>> Stopping volume will make its data inaccessible. Do you want to continue? >>> (y/n) y >>> volume stop: live: failed: Staging failed on stor106. Error: rebalance >>> session is in progress for the volume 'live' >>> Staging failed on stor104. Error: rebalance session is in progress for the >>> volume ‘live' >> Can you run [ps aux | grep "rebalance"] on all the servers and post here? >> Just want to check whether rebalance is really running or not. Again >> requesting glusterd team to give inputs. >> >>> >>> >>>> 4. Gluster version >>> >>> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster' >>> stor104: glusterfs-api-3.7.3-1.el7.x86_64 >>> stor104: glusterfs-server-3.7.3-1.el7.x86_64 >>> stor104: glusterfs-libs-3.7.3-1.el7.x86_64 >>> stor104: glusterfs-3.7.3-1.el7.x86_64 >>> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64 >>> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >>> stor104: glusterfs-cli-3.7.3-1.el7.x86_64 >>> >>> stor105: glusterfs-3.7.3-1.el7.x86_64 >>> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >>> stor105: glusterfs-api-3.7.3-1.el7.x86_64 >>> stor105: glusterfs-cli-3.7.3-1.el7.x86_64 >>> stor105: glusterfs-server-3.7.3-1.el7.x86_64 >>> stor105: glusterfs-libs-3.7.3-1.el7.x86_64 >>> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64 >>> >>> stor106: glusterfs-libs-3.7.3-1.el7.x86_64 >>> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64 >>> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >>> stor106: glusterfs-api-3.7.3-1.el7.x86_64 >>> stor106: glusterfs-cli-3.7.3-1.el7.x86_64 >>> stor106: glusterfs-server-3.7.3-1.el7.x86_64 >>> stor106: glusterfs-3.7.3-1.el7.x86_64 >>> >>>> >>>> Will ask for more information in case needed. >>>> >>>> Regards, >>>> Susant >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >>>>> To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" >>>>> <nbalacha@xxxxxxxxxx>, "Susant Palai" >>>>> <spalai@xxxxxxxxxx>, "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx> >>>>> Cc: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx> >>>>> Sent: Monday, 17 August, 2015 7:03:20 PM >>>>> Subject: Fwd: Skipped files during rebalance >>>>> >>>>> Hi DHT team, >>>>> >>>>> This email somehow didn’t get forwarded to you. >>>>> >>>>> In addition to my problem described below, here is one example of free >>>>> memory >>>>> after everything failed >>>>> >>>>> [root@highlander ~]# pdsh -g live 'free -m' >>>>> stor106: total used free shared >>>>> buff/cache >>>>> available >>>>> stor106: Mem: 193249 124784 1347 9 >>>>> 67118 >>>>> 12769 >>>>> stor106: Swap: 0 0 0 >>>>> stor104: total used free shared >>>>> buff/cache >>>>> available >>>>> stor104: Mem: 193249 107617 31323 9 >>>>> 54308 >>>>> 42752 >>>>> stor104: Swap: 0 0 0 >>>>> stor105: total used free shared >>>>> buff/cache >>>>> available >>>>> stor105: Mem: 193248 141804 6736 9 >>>>> 44707 >>>>> 9713 >>>>> stor105: Swap: 0 0 0 >>>>> >>>>> So after the failed operation, there’s almost no memory free, and it is >>>>> also >>>>> not freed up. >>>>> >>>>> Thank you for pointing me to any directions, >>>>> >>>>> Kind regards, >>>>> >>>>> — >>>>> Christophe >>>>> >>>>> >>>>> Begin forwarded message: >>>>> >>>>> From: Christophe TREFOIS >>>>> <christophe.trefois@xxxxxx<mailto:christophe.trefois@xxxxxx>> >>>>> Subject: Re: Skipped files during rebalance >>>>> Date: 17 Aug 2015 11:54:32 CEST >>>>> To: Mohammed Rafi K C <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> >>>>> Cc: "gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>" >>>>> <gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>> >>>>> >>>>> Dear Rafi, >>>>> >>>>> Thanks for submitting a patch. >>>>> >>>>> @DHT, I have two additional questions / problems. >>>>> >>>>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes >>>>> dramatically high, eg out of 196 GB available per node, RAM usage would >>>>> fill >>>>> up to 195.6 GB. This seems quite excessive and strange to me. >>>>> >>>>> 2. As you can see, the rebalance (with data) failed as one endpoint >>>>> becomes >>>>> unconnected (even though it still is connected). I’m thinking this could >>>>> be >>>>> due to the high RAM usage? >>>>> >>>>> Thank you for your help, >>>>> >>>>> — >>>>> Christophe >>>>> >>>>> Dr Christophe Trefois, Dipl.-Ing. >>>>> Technical Specialist / Post-Doc >>>>> >>>>> UNIVERSITÉ DU LUXEMBOURG >>>>> >>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>> Campus Belval | House of Biomedicine >>>>> 6, avenue du Swing >>>>> L-4367 Belvaux >>>>> T: +352 46 66 44 6124 >>>>> F: +352 46 66 44 6949 >>>>> http://www.uni.lu/lcsb >>>>> >>>>> [Facebook]<https://www.facebook.com/trefex> [Twitter] >>>>> <https://twitter.com/Trefex> [Google Plus] >>>>> <https://plus.google.com/+ChristopheTrefois/> [Linkedin] >>>>> <https://www.linkedin.com/in/trefoischristophe> [skype] >>>>> <http://skype:Trefex?call> >>>>> >>>>> >>>>> ---- >>>>> This message is confidential and may contain privileged information. >>>>> It is intended for the named recipient only. >>>>> If you receive it in error please notify me and permanently delete the >>>>> original message and any copies. >>>>> ---- >>>>> >>>>> >>>>> >>>>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C >>>>> <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> wrote: >>>>> >>>>> >>>>> >>>>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote: >>>>> Dear all, >>>>> >>>>> I have successfully added a new node to our setup, and finally managed >>>>> to >>>>> get >>>>> a successful fix-layout run as well with no errors. >>>>> >>>>> Now, as per the documentation, I started a gluster volume rebalance live >>>>> start task and I see many skipped files. >>>>> The error log contains then entires as follows for each skipped file. >>>>> >>>>> [2015-08-16 20:23:30.591161] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed >>>>> [2015-08-16 20:23:30.768391] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed >>>>> [2015-08-16 20:23:30.804811] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed >>>>> [2015-08-16 20:23:30.805201] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed >>>>> [2015-08-16 20:23:30.880037] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed >>>>> [2015-08-16 20:23:31.038236] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed >>>>> [2015-08-16 20:23:31.259762] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed >>>>> [2015-08-16 20:23:31.333764] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed >>>>> [2015-08-16 20:23:31.340190] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed >>>>> >>>>> Update: one of the rebalance tasks now failed. >>>>> >>>>> @Rafi, I got the same error as Friday except this time with data. >>>>> >>>>> Packets that carrying the ping request could be waiting in the queue >>>>> during >>>>> the whole time-out period, because of the heavy traffic in the network. >>>>> I >>>>> have sent a patch for this. You can track the status here : >>>>> http://review.gluster.org/11935 >>>>> >>>>> >>>>> >>>>> [2015-08-16 20:24:34.533167] C >>>>> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: >>>>> server >>>>> 192.168.123.104:49164 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>>> d+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/li >>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: >>>>> forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at >>>>> 2015-08-16 20:23:51.305640 (xid=0x5dd4da) >>>>> [2015-08-16 20:24:34.533672] E [MSGID: 114031] >>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote >>>>> operation failed [Transport endpoint is not connected] >>>>> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>>> d+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/li >>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: >>>>> forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at >>>>> 2015-08-16 >>>>> 20:23:51.303938 (xid=0x5dd4d7) >>>>> [2015-08-16 20:24:34.534347] E [MSGID: 109023] >>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>> /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_ >>>>> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data >>>>> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>>> d+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>> called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8) >>>>> [2015-08-16 20:24:34.534579] E [MSGID: 109023] >>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>> /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex: >>>>> failed to migrate data >>>>> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>> called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db) >>>>> [2015-08-16 20:24:34.534745] E [MSGID: 109023] >>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>> /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex: >>>>> failed to migrate data >>>>> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>> called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc) >>>>> [2015-08-16 20:24:34.535232] E [MSGID: 109023] >>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>> /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex: >>>>> failed to migrate data >>>>> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>> called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd) >>>>> [2015-08-16 20:24:34.536069] E [MSGID: 109023] >>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>> /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex: >>>>> failed to migrate data >>>>> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) >>>>> op(LOOKUP(27)) >>>>> called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de) >>>>> [2015-08-16 20:24:34.536339] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex >>>>> lookup failed >>>>> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) >>>>> op(LOOKUP(27)) >>>>> called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df) >>>>> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) >>>>> op(LOOKUP(27)) >>>>> called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0) >>>>> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) >>>>> op(LOOKUP(27)) >>>>> called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1) >>>>> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) >>>>> op(LOOKUP(27)) >>>>> called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2) >>>>> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>> (--> >>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called >>>>> at >>>>> 2015-08-16 20:23:52.530107 (xid=0x5dd4e3) >>>>> [2015-08-16 20:24:34.538475] E [MSGID: 114031] >>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote >>>>> operation failed [Transport endpoint is not connected] >>>>> The message "E [MSGID: 114031] >>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] >>>>> 0-live-client-0: remote operation failed [Transport endpoint is not >>>>> connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and >>>>> [2015-08-16 20:24:34.538535] >>>>> [2015-08-16 20:24:34.538584] E [MSGID: 109023] >>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate >>>>> file failed: 002004003.flex lookup failed >>>>> [2015-08-16 20:24:34.538904] E [MSGID: 109023] >>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate >>>>> file failed: 003009008.flex lookup failed >>>>> [2015-08-16 20:24:34.539724] E [MSGID: 109023] >>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>> failed:/hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex >>>>> lookup failed >>>>> [2015-08-16 20:24:34.539820] E [MSGID: 109016] >>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout >>>>> failed >>>>> for /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25) >>>>> [2015-08-16 20:24:34.540031] E [MSGID: 109016] >>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout >>>>> failed >>>>> for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1 >>>>> [2015-08-16 20:24:34.540691] E [MSGID: 114031] >>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex >>>>> [Transport endpoint is not connected] >>>>> [2015-08-16 20:24:34.541152] E [MSGID: 114031] >>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex >>>>> [Transport endpoint is not connected] >>>>> [2015-08-16 20:24:34.541331] E [MSGID: 114031] >>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex >>>>> [Transport endpoint is not connected] >>>>> [2015-08-16 20:24:34.541486] E [MSGID: 109016] >>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout >>>>> failed >>>>> for /hcs/hcs/OperaArchiveCol >>>>> [2015-08-16 20:24:34.541572] E [MSGID: 109016] >>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout >>>>> failed >>>>> for /hcs/hcs >>>>> [2015-08-16 20:24:34.541639] E [MSGID: 109016] >>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout >>>>> failed >>>>> for /hcs >>>>> >>>>> Any help would be greatly appreciated. >>>>> CCing dht teams to give you better idea about why rebalance failed/ and >>>>> about >>>>> huge memory consumption by rebalance process (200GB RAM) . >>>>> >>>>> Regards >>>>> Rafi KC >>>>> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> -- >>>>> Christophe >>>>> >>>>> Dr Christophe Trefois, Dipl.-Ing. >>>>> Technical Specialist / Post-Doc >>>>> >>>>> UNIVERSITÉ DU LUXEMBOURG >>>>> >>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>> Campus Belval | House of Biomedicine >>>>> 6, avenue du Swing >>>>> L-4367 Belvaux >>>>> T: +352 46 66 44 6124 >>>>> F: +352 46 66 44 6949 >>>>> http://www.uni.lu/lcsb >>>>> >>>>> ---- >>>>> This message is confidential and may contain privileged information. >>>>> It is intended for the named recipient only. >>>>> If you receive it in error please notify me and permanently delete the >>>>> original message and any copies. >>>>> ---- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel@xxxxxxxxxxx<mailto:Gluster-devel@xxxxxxxxxxx> >>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>> >>> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@xxxxxxxxxxx >> http://www.gluster.org/mailman/listinfo/gluster-devel >> _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel