Dear Susant, The rebalance failed again and also had (in my opinion) excessive RAM usage. Please find a very detailled list below. All logs: http://wikisend.com/download/651948/allstores.tar.gz Thank you for letting me know how I could successfully complete the rebalance process. The fedora pastes are the output of top of each node at that time (more or less). Please let me know if you need more information, Best, —— Start of mem info # After reboot, before starting glusterd [root@highlander ~]# pdsh -g live 'free -m' stor106: total used free shared buff/cache available stor106: Mem: 193249 2208 190825 9 215 190772 stor106: Swap: 0 0 0 stor105: total used free shared buff/cache available stor105: Mem: 193248 2275 190738 9 234 190681 stor105: Swap: 0 0 0 stor104: total used free shared buff/cache available stor104: Mem: 193249 2221 190811 9 216 190757 stor104: Swap: 0 0 0 [root@highlander ~]# # Gluster Info [root@stor106 glusterfs]# gluster volume info Volume Name: live Type: Distribute Volume ID: 1328637d-7730-4627-8945-bbe43626d527 Status: Started Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: stor104:/zfs/brick0/brick Brick2: stor104:/zfs/brick1/brick Brick3: stor104:/zfs/brick2/brick Brick4: stor106:/zfs/brick0/brick Brick5: stor106:/zfs/brick1/brick Brick6: stor106:/zfs/brick2/brick Brick7: stor105:/zfs/brick0/brick Brick8: stor105:/zfs/brick1/brick Brick9: stor105:/zfs/brick2/brick Options Reconfigured: nfs.disable: true diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.write-behind-window-size: 4MB performance.io-thread-count: 32 performance.client-io-threads: on performance.cache-size: 1GB performance.cache-refresh-timeout: 60 performance.cache-max-file-size: 4MB cluster.data-self-heal-algorithm: full diagnostics.client-log-level: ERROR diagnostics.brick-log-level: ERROR cluster.min-free-disk: 1% server.allow-insecure: on # Starting gluserd [root@highlander ~]# pdsh -g live 'systemctl start glusterd' [root@highlander ~]# pdsh -g live 'free -m' stor106: total used free shared buff/cache available stor106: Mem: 193249 2290 190569 9 389 190587 stor106: Swap: 0 0 0 stor104: total used free shared buff/cache available stor104: Mem: 193249 2297 190557 9 394 190571 stor104: Swap: 0 0 0 stor105: total used free shared buff/cache available stor105: Mem: 193248 2286 190554 9 407 190595 stor105: Swap: 0 0 0 [root@highlander ~]# systemctl start glusterd [root@highlander ~]# gluster volume start live volume start: live: success [root@highlander ~]# gluster volume status Status of volume: live Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick stor104:/zfs/brick0/brick 49164 0 Y 5945 Brick stor104:/zfs/brick1/brick 49165 0 Y 5963 Brick stor104:/zfs/brick2/brick 49166 0 Y 5981 Brick stor106:/zfs/brick0/brick 49158 0 Y 5256 Brick stor106:/zfs/brick1/brick 49159 0 Y 5274 Brick stor106:/zfs/brick2/brick 49160 0 Y 5292 Brick stor105:/zfs/brick0/brick 49155 0 Y 5284 Brick stor105:/zfs/brick1/brick 49156 0 Y 5302 Brick stor105:/zfs/brick2/brick 49157 0 Y 5320 NFS Server on localhost N/A N/A N N/A NFS Server on 192.168.123.106 N/A N/A N N/A NFS Server on stor105 N/A N/A N N/A NFS Server on 192.168.123.104 N/A N/A N N/A Task Status of Volume live ------------------------------------------------------------------------------ There are no active volume tasks [root@highlander ~]# # Memory usage of each node after 5 minutes Output of top: pdsh -g live 'top -n 1 -b' | fpaste http://paste.fedoraproject.org/256710/14399886/ [root@highlander ~]# pdsh -g live 'free -m' stor106: total used free shared buff/cache available stor106: Mem: 193249 6877 184154 9 2218 184250 stor106: Swap: 0 0 0 stor105: total used free shared buff/cache available stor105: Mem: 193248 22126 169351 9 1771 169403 stor105: Swap: 0 0 0 stor104: total used free shared buff/cache available stor104: Mem: 193249 2708 188638 9 1902 188687 stor104: Swap: 0 0 0 # Memory usage of each node after 45 minutes [root@highlander ~]# pdsh -g live 'free -m' stor104: total used free shared buff/cache available stor104: Mem: 193249 3131 184168 9 5949 184524 stor104: Swap: 0 0 0 stor106: total used free shared buff/cache available stor106: Mem: 193249 27919 158176 9 7153 158894 stor106: Swap: 0 0 0 stor105: total used free shared buff/cache available stor105: Mem: 193248 117096 70621 9 5530 70891 stor105: Swap: 0 0 0 http://paste.fedoraproject.org/256726/43999119 # Memory usage of each node after 90 minutes [root@highlander ~]# pdsh -g live 'free -m' stor104: total used free shared buff/cache available stor104: Mem: 193249 3390 181034 9 8825 181661 stor104: Swap: 0 0 0 stor106: total used free shared buff/cache available stor106: Mem: 193249 45780 136424 9 11044 137759 stor106: Swap: 0 0 0 stor105: total used free shared buff/cache available stor105: Mem: 193248 151483 33492 9 8272 33972 stor105: Swap: 0 0 0 http://paste.fedoraproject.org/256745/14399937 # Memory usage after 5 hours ```bash [root@highlander ~]# pdsh -g live 'free -m' stor104: total used free shared buff/cache available stor104: Mem: 193249 4645 163186 9 25417 165473 stor104: Swap: 0 0 0 stor105: total used free shared buff/cache available stor105: Mem: 193248 155094 14784 9 23369 16640 stor105: Swap: 0 0 0 stor106: total used free shared buff/cache available stor106: Mem: 193249 141379 16515 9 35355 23714 stor106: Swap: 0 0 0 ``` http://paste.fedoraproject.org/256879/44001235 # Memory usage after 6 hours ```bash [root@highlander ~]# pdsh -g live 'free -m' stor106: total used free shared buff/cache available stor106: Mem: 193249 140526 12207 9 40516 21612 stor106: Swap: 0 0 0 stor104: total used free shared buff/cache available stor104: Mem: 193249 102923 58748 9 31578 63632 stor104: Swap: 0 0 0 stor105: total used free shared buff/cache available stor105: Mem: 193248 155394 10876 9 26977 13154 stor105: Swap: 0 0 0 ``` http://paste.fedoraproject.org/256905/00168781 # Memory after 24 hours + Failed ```bash [root@highlander ~]# pdsh -g live 'free -m' stor105: total used free shared buff/cache available stor105: Mem: 193248 136123 6323 9 50801 10281 stor105: Swap: 0 0 0 stor104: total used free shared buff/cache available stor104: Mem: 193249 125320 2812 9 65116 17337 stor104: Swap: 0 0 0 stor106: total used free shared buff/cache available stor106: Mem: 193249 111997 13969 9 67282 19429 stor106: Swap: 0 0 0 [root@highlander ~]# ``` http://paste.fedoraproject.org/257254/14400880 # Failed logs ```bash [root@highlander ~]# gluster volume rebalance live status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 192.168.123.104 748812 4.4TB 4160456 1311 156772 failed 63114.00 192.168.123.106 1187917 3.3TB 6021931 21625 1209503 failed 75243.00 stor105 0 0Bytes 2440431 16 196 failed 69658.00 volume rebalance: live: success: ``` Dr Christophe Trefois, Dipl.-Ing. Technical Specialist / Post-Doc UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE Campus Belval | House of Biomedicine 6, avenue du Swing L-4367 Belvaux T: +352 46 66 44 6124 F: +352 46 66 44 6949 http://www.uni.lu/lcsb ---- This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies. ---- > On 19 Aug 2015, at 08:14, Susant Palai <spalai@xxxxxxxxxx> wrote: > > Comments inline. > > ----- Original Message ----- >> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >> To: "Susant Palai" <spalai@xxxxxxxxxx> >> Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, "Shyamsundar >> Ranganathan" <srangana@xxxxxxxxxx>, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel" >> <gluster-devel@xxxxxxxxxxx> >> Sent: Tuesday, August 18, 2015 8:08:41 PM >> Subject: Re: Skipped files during rebalance >> >> Hi Susan, >> >> Thank you for the response. >> >>> On 18 Aug 2015, at 10:45, Susant Palai <spalai@xxxxxxxxxx> wrote: >>> >>> Hi Christophe, >>> >>> Need some info regarding the high mem-usage. >>> >>> 1. Top output: To see whether any other process eating up memory. > > I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs). > > >>> 2. Gluster volume info >> >> root@highlander ~]# gluster volume info >> >> Volume Name: live >> Type: Distribute >> Volume ID: 1328637d-7730-4627-8945-bbe43626d527 >> Status: Started >> Number of Bricks: 9 >> Transport-type: tcp >> Bricks: >> Brick1: stor104:/zfs/brick0/brick >> Brick2: stor104:/zfs/brick1/brick >> Brick3: stor104:/zfs/brick2/brick >> Brick4: stor106:/zfs/brick0/brick >> Brick5: stor106:/zfs/brick1/brick >> Brick6: stor106:/zfs/brick2/brick >> Brick7: stor105:/zfs/brick0/brick >> Brick8: stor105:/zfs/brick1/brick >> Brick9: stor105:/zfs/brick2/brick >> Options Reconfigured: >> diagnostics.count-fop-hits: on >> diagnostics.latency-measurement: on >> server.allow-insecure: on >> cluster.min-free-disk: 1% >> diagnostics.brick-log-level: ERROR >> diagnostics.client-log-level: ERROR >> cluster.data-self-heal-algorithm: full >> performance.cache-max-file-size: 4MB >> performance.cache-refresh-timeout: 60 >> performance.cache-size: 1GB >> performance.client-io-threads: on >> performance.io-thread-count: 32 >> performance.write-behind-window-size: 4MB >> >>> 3. Is rebalance process still running? If yes can you point to specific mem >>> usage by rebalance process? The high mem-usage was seen during rebalance >>> or even post rebalance? >> >> I would like to restart the rebalance process since it failed… But I can’t as >> the volume cannot be stopped (I wanted to reboot the servers to have a clean >> testing grounds). >> >> Here are the logs from the three nodes: >> http://paste.fedoraproject.org/256183/43989079 >> >> Maybe you could help me figure out how to stop the volume? >> >> This is what happens >> >> [root@highlander ~]# gluster volume rebalance live stop >> volume rebalance: live: failed: Rebalance not started. > > Requesting glusterd team to give input. >> >> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop" >> volume rebalance: live: failed: Rebalance not started. >> >> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop" >> volume rebalance: live: failed: Rebalance not started. >> >> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop" >> volume rebalance: live: failed: Rebalance not started. >> >> [root@highlander ~]# gluster volume rebalance live stop >> volume rebalance: live: failed: Rebalance not started. >> >> [root@highlander ~]# gluster volume stop live >> Stopping volume will make its data inaccessible. Do you want to continue? >> (y/n) y >> volume stop: live: failed: Staging failed on stor106. Error: rebalance >> session is in progress for the volume 'live' >> Staging failed on stor104. Error: rebalance session is in progress for the >> volume ‘live' > Can you run [ps aux | grep "rebalance"] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs. > >> >> >>> 4. Gluster version >> >> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster' >> stor104: glusterfs-api-3.7.3-1.el7.x86_64 >> stor104: glusterfs-server-3.7.3-1.el7.x86_64 >> stor104: glusterfs-libs-3.7.3-1.el7.x86_64 >> stor104: glusterfs-3.7.3-1.el7.x86_64 >> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64 >> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >> stor104: glusterfs-cli-3.7.3-1.el7.x86_64 >> >> stor105: glusterfs-3.7.3-1.el7.x86_64 >> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >> stor105: glusterfs-api-3.7.3-1.el7.x86_64 >> stor105: glusterfs-cli-3.7.3-1.el7.x86_64 >> stor105: glusterfs-server-3.7.3-1.el7.x86_64 >> stor105: glusterfs-libs-3.7.3-1.el7.x86_64 >> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64 >> >> stor106: glusterfs-libs-3.7.3-1.el7.x86_64 >> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64 >> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >> stor106: glusterfs-api-3.7.3-1.el7.x86_64 >> stor106: glusterfs-cli-3.7.3-1.el7.x86_64 >> stor106: glusterfs-server-3.7.3-1.el7.x86_64 >> stor106: glusterfs-3.7.3-1.el7.x86_64 >> >>> >>> Will ask for more information in case needed. >>> >>> Regards, >>> Susant >>> >>> >>> ----- Original Message ----- >>>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >>>> To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" >>>> <nbalacha@xxxxxxxxxx>, "Susant Palai" >>>> <spalai@xxxxxxxxxx>, "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx> >>>> Cc: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx> >>>> Sent: Monday, 17 August, 2015 7:03:20 PM >>>> Subject: Fwd: Skipped files during rebalance >>>> >>>> Hi DHT team, >>>> >>>> This email somehow didn’t get forwarded to you. >>>> >>>> In addition to my problem described below, here is one example of free >>>> memory >>>> after everything failed >>>> >>>> [root@highlander ~]# pdsh -g live 'free -m' >>>> stor106: total used free shared >>>> buff/cache >>>> available >>>> stor106: Mem: 193249 124784 1347 9 >>>> 67118 >>>> 12769 >>>> stor106: Swap: 0 0 0 >>>> stor104: total used free shared >>>> buff/cache >>>> available >>>> stor104: Mem: 193249 107617 31323 9 >>>> 54308 >>>> 42752 >>>> stor104: Swap: 0 0 0 >>>> stor105: total used free shared >>>> buff/cache >>>> available >>>> stor105: Mem: 193248 141804 6736 9 >>>> 44707 >>>> 9713 >>>> stor105: Swap: 0 0 0 >>>> >>>> So after the failed operation, there’s almost no memory free, and it is >>>> also >>>> not freed up. >>>> >>>> Thank you for pointing me to any directions, >>>> >>>> Kind regards, >>>> >>>> — >>>> Christophe >>>> >>>> >>>> Begin forwarded message: >>>> >>>> From: Christophe TREFOIS >>>> <christophe.trefois@xxxxxx<mailto:christophe.trefois@xxxxxx>> >>>> Subject: Re: Skipped files during rebalance >>>> Date: 17 Aug 2015 11:54:32 CEST >>>> To: Mohammed Rafi K C <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> >>>> Cc: "gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>" >>>> <gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>> >>>> >>>> Dear Rafi, >>>> >>>> Thanks for submitting a patch. >>>> >>>> @DHT, I have two additional questions / problems. >>>> >>>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes >>>> dramatically high, eg out of 196 GB available per node, RAM usage would >>>> fill >>>> up to 195.6 GB. This seems quite excessive and strange to me. >>>> >>>> 2. As you can see, the rebalance (with data) failed as one endpoint >>>> becomes >>>> unconnected (even though it still is connected). I’m thinking this could >>>> be >>>> due to the high RAM usage? >>>> >>>> Thank you for your help, >>>> >>>> — >>>> Christophe >>>> >>>> Dr Christophe Trefois, Dipl.-Ing. >>>> Technical Specialist / Post-Doc >>>> >>>> UNIVERSITÉ DU LUXEMBOURG >>>> >>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>> Campus Belval | House of Biomedicine >>>> 6, avenue du Swing >>>> L-4367 Belvaux >>>> T: +352 46 66 44 6124 >>>> F: +352 46 66 44 6949 >>>> http://www.uni.lu/lcsb >>>> >>>> [Facebook]<https://www.facebook.com/trefex> [Twitter] >>>> <https://twitter.com/Trefex> [Google Plus] >>>> <https://plus.google.com/+ChristopheTrefois/> [Linkedin] >>>> <https://www.linkedin.com/in/trefoischristophe> [skype] >>>> <http://skype:Trefex?call> >>>> >>>> >>>> ---- >>>> This message is confidential and may contain privileged information. >>>> It is intended for the named recipient only. >>>> If you receive it in error please notify me and permanently delete the >>>> original message and any copies. >>>> ---- >>>> >>>> >>>> >>>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C >>>> <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> wrote: >>>> >>>> >>>> >>>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote: >>>> Dear all, >>>> >>>> I have successfully added a new node to our setup, and finally managed to >>>> get >>>> a successful fix-layout run as well with no errors. >>>> >>>> Now, as per the documentation, I started a gluster volume rebalance live >>>> start task and I see many skipped files. >>>> The error log contains then entires as follows for each skipped file. >>>> >>>> [2015-08-16 20:23:30.591161] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed >>>> [2015-08-16 20:23:30.768391] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed >>>> [2015-08-16 20:23:30.804811] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed >>>> [2015-08-16 20:23:30.805201] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed >>>> [2015-08-16 20:23:30.880037] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed >>>> [2015-08-16 20:23:31.038236] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed >>>> [2015-08-16 20:23:31.259762] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed >>>> [2015-08-16 20:23:31.333764] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed >>>> [2015-08-16 20:23:31.340190] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed >>>> >>>> Update: one of the rebalance tasks now failed. >>>> >>>> @Rafi, I got the same error as Friday except this time with data. >>>> >>>> Packets that carrying the ping request could be waiting in the queue >>>> during >>>> the whole time-out period, because of the heavy traffic in the network. I >>>> have sent a patch for this. You can track the status here : >>>> http://review.gluster.org/11935 >>>> >>>> >>>> >>>> [2015-08-16 20:24:34.533167] C >>>> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server >>>> 192.168.123.104:49164 has not responded in the last 42 seconds, >>>> disconnecting. >>>> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>> d+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/li >>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: >>>> forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at >>>> 2015-08-16 20:23:51.305640 (xid=0x5dd4da) >>>> [2015-08-16 20:24:34.533672] E [MSGID: 114031] >>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote >>>> operation failed [Transport endpoint is not connected] >>>> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>> d+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/li >>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: >>>> forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at >>>> 2015-08-16 >>>> 20:23:51.303938 (xid=0x5dd4d7) >>>> [2015-08-16 20:24:34.534347] E [MSGID: 109023] >>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>> /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_ >>>> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data >>>> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>> d+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>> called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8) >>>> [2015-08-16 20:24:34.534579] E [MSGID: 109023] >>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>> /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex: >>>> failed to migrate data >>>> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>> called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db) >>>> [2015-08-16 20:24:34.534745] E [MSGID: 109023] >>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>> /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex: >>>> failed to migrate data >>>> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>> called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc) >>>> [2015-08-16 20:24:34.535232] E [MSGID: 109023] >>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>> /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex: >>>> failed to migrate data >>>> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>> called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd) >>>> [2015-08-16 20:24:34.536069] E [MSGID: 109023] >>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>> /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex: >>>> failed to migrate data >>>> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>> called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de) >>>> [2015-08-16 20:24:34.536339] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex >>>> lookup failed >>>> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>> called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df) >>>> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>> called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0) >>>> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>> called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1) >>>> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>> called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2) >>>> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>> (--> >>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called >>>> at >>>> 2015-08-16 20:23:52.530107 (xid=0x5dd4e3) >>>> [2015-08-16 20:24:34.538475] E [MSGID: 114031] >>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote >>>> operation failed [Transport endpoint is not connected] >>>> The message "E [MSGID: 114031] >>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] >>>> 0-live-client-0: remote operation failed [Transport endpoint is not >>>> connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and >>>> [2015-08-16 20:24:34.538535] >>>> [2015-08-16 20:24:34.538584] E [MSGID: 109023] >>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate >>>> file failed: 002004003.flex lookup failed >>>> [2015-08-16 20:24:34.538904] E [MSGID: 109023] >>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate >>>> file failed: 003009008.flex lookup failed >>>> [2015-08-16 20:24:34.539724] E [MSGID: 109023] >>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>> failed:/hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex >>>> lookup failed >>>> [2015-08-16 20:24:34.539820] E [MSGID: 109016] >>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>> for /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25) >>>> [2015-08-16 20:24:34.540031] E [MSGID: 109016] >>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>> for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1 >>>> [2015-08-16 20:24:34.540691] E [MSGID: 114031] >>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex >>>> [Transport endpoint is not connected] >>>> [2015-08-16 20:24:34.541152] E [MSGID: 114031] >>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex >>>> [Transport endpoint is not connected] >>>> [2015-08-16 20:24:34.541331] E [MSGID: 114031] >>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex >>>> [Transport endpoint is not connected] >>>> [2015-08-16 20:24:34.541486] E [MSGID: 109016] >>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>> for /hcs/hcs/OperaArchiveCol >>>> [2015-08-16 20:24:34.541572] E [MSGID: 109016] >>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>> for /hcs/hcs >>>> [2015-08-16 20:24:34.541639] E [MSGID: 109016] >>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>> for /hcs >>>> >>>> Any help would be greatly appreciated. >>>> CCing dht teams to give you better idea about why rebalance failed/ and >>>> about >>>> huge memory consumption by rebalance process (200GB RAM) . >>>> >>>> Regards >>>> Rafi KC >>>> >>>> >>>> >>>> >>>> Thanks, >>>> >>>> -- >>>> Christophe >>>> >>>> Dr Christophe Trefois, Dipl.-Ing. >>>> Technical Specialist / Post-Doc >>>> >>>> UNIVERSITÉ DU LUXEMBOURG >>>> >>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>> Campus Belval | House of Biomedicine >>>> 6, avenue du Swing >>>> L-4367 Belvaux >>>> T: +352 46 66 44 6124 >>>> F: +352 46 66 44 6949 >>>> http://www.uni.lu/lcsb >>>> >>>> ---- >>>> This message is confidential and may contain privileged information. >>>> It is intended for the named recipient only. >>>> If you receive it in error please notify me and permanently delete the >>>> original message and any copies. >>>> ---- >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel@xxxxxxxxxxx<mailto:Gluster-devel@xxxxxxxxxxx> >>>> http://www.gluster.org/mailman/listinfo/gluster-devel >> >> _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel