Hi Mohammed, I requested for the state-dump while rebalance is running in mem-leak state. State-dump will be helpful to debug mem-leak. In the mean time we are trying to reproduce the issue. Regards, Susaant ----- Original Message ----- From: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx> To: "Christophe TREFOIS" <christophe.trefois@xxxxxx> Cc: "Susant Palai" <spalai@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "valentin plugaru" <valentin.plugaru@xxxxxx> Sent: Monday, 24 August, 2015 12:37:50 PM Subject: Re: Fwd: Skipped files during rebalance > > > Dear Susant, > > Do you think the patch submitted by Rafi could help with this? Yes, unless there is a network failure/problem in your setup. > > The nodes are on the same network in the same rack and as such should have no connectivity issues. I could see DNS resolution failure in log messages, can you cross check the connection setup. > > Is it possible that the processes on nodes 104 and 106 were too “busy” and unable to accept new connections? > > Any helpers would be appreciated, > > — > Christophe > > Dr Christophe Trefois, Dipl.-Ing. > Technical Specialist / Post-Doc > > UNIVERSITÉ DU LUXEMBOURG > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > Campus Belval | House of Biomedicine > 6, avenue du Swing > L-4367 Belvaux > T: +352 46 66 44 6124 > F: +352 46 66 44 6949 > http://www.uni.lu/lcsb > > > > ---- > This message is confidential and may contain privileged information. > It is intended for the named recipient only. > If you receive it in error please notify me and permanently delete the original message and any copies. > ---- > > > >> On 21 Aug 2015, at 14:57, Susant Palai <spalai@xxxxxxxxxx> wrote: >> >> Hi, >> Mostly the rebalance failures are due to the network problem. >> >> Here is the log: >> >> [2015-08-16 20:31:36.301467] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003002002.flex lookup failed >> [2015-08-16 20:31:36.921405] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003004005.flex lookup failed >> [2015-08-16 20:31:36.921591] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/006004004.flex lookup failed >> [2015-08-16 20:31:36.921770] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/005004007.flex lookup failed >> [2015-08-16 20:31:37.577758] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/007004005.flex lookup failed >> [2015-08-16 20:34:12.387425] E [socket.c:2332:socket_connect_finish] 0-live-client-4: connection to 192.168.123.106:24007 failed (Connection refused) >> [2015-08-16 20:34:12.392820] E [socket.c:2332:socket_connect_finish] 0-live-client-5: connection to 192.168.123.106:24007 failed (Connection refused) >> [2015-08-16 20:34:12.398023] E [socket.c:2332:socket_connect_finish] 0-live-client-0: connection to 192.168.123.104:24007 failed (Connection refused) >> [2015-08-16 20:34:12.402904] E [socket.c:2332:socket_connect_finish] 0-live-client-2: connection to 192.168.123.104:24007 failed (Connection refused) >> [2015-08-16 20:34:12.407464] E [socket.c:2332:socket_connect_finish] 0-live-client-3: connection to 192.168.123.106:24007 failed (Connection refused) >> [2015-08-16 20:34:12.412249] E [socket.c:2332:socket_connect_finish] 0-live-client-1: connection to 192.168.123.104:24007 failed (Connection refused) >> [2015-08-16 20:34:12.416621] E [socket.c:2332:socket_connect_finish] 0-live-client-6: connection to 192.168.123.105:24007 failed (Connection refused) >> [2015-08-16 20:34:12.420906] E [socket.c:2332:socket_connect_finish] 0-live-client-8: connection to 192.168.123.105:24007 failed (Connection refused) >> [2015-08-16 20:34:12.425066] E [socket.c:2332:socket_connect_finish] 0-live-client-7: connection to 192.168.123.105:24007 failed (Connection refused) >> [2015-08-16 20:34:17.479925] E [socket.c:2332:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused) >> [2015-08-16 20:36:23.788206] E [MSGID: 101075] [common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) >> [2015-08-16 20:36:23.788286] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-4: DNS resolution failed on host stor106 >> [2015-08-16 20:36:23.788387] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-5: DNS resolution failed on host stor106 >> [2015-08-16 20:36:23.788918] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-0: DNS resolution failed on host stor104 >> [2015-08-16 20:36:23.789233] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-2: DNS resolution failed on host stor104 >> [2015-08-16 20:36:23.789295] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-3: DNS resolution failed on host stor106 >> >> >> For the high mem usage part I will try to run rebalance and analyze. In the mean time it will be help full if you can take a state dump of the rebalance process when it is using high RAM. >> >> Here are the steps to take the state dump. >> >> 1. Find your state-dump destination; Run "gluster --print-statedumpdir". The state dump will be stored in this location. >> >> 2. When you see any of the rebalance process on any of the servers using high memory issue the following command. >> "kill -USR1 <pid-of-rebalance-process>". ---> ps aux | grep rebalance should give the rebalance process pid. >> >> The state dump should give some hint about the high mem-usage. >> >> Thanks, >> Susant >> >> ----- Original Message ----- >> From: "Susant Palai" <spalai@xxxxxxxxxx> >> To: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> >> Sent: Friday, 21 August, 2015 3:52:07 PM >> Subject: Re: Skipped files during rebalance >> >> Thanks Christophe for the details. Will get back to you with the analysis. >> >> Regards, >> Susant >> >> ----- Original Message ----- >> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >> To: "Susant Palai" <spalai@xxxxxxxxxx> >> Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx>, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx> >> Sent: Friday, 21 August, 2015 12:39:05 AM >> Subject: Re: Skipped files during rebalance >> >> Dear Susant, >> >> The rebalance failed again and also had (in my opinion) excessive RAM usage. >> >> Please find a very detailled list below. >> >> All logs: >> >> http://wikisend.com/download/651948/allstores.tar.gz >> >> Thank you for letting me know how I could successfully complete the rebalance process. >> The fedora pastes are the output of top of each node at that time (more or less). >> >> Please let me know if you need more information, >> >> Best, >> >> —— Start of mem info >> >> # After reboot, before starting glusterd >> >> [root@highlander ~]# pdsh -g live 'free -m' >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 2208 190825 9 215 190772 >> stor106: Swap: 0 0 0 >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 2275 190738 9 234 190681 >> stor105: Swap: 0 0 0 >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 2221 190811 9 216 190757 >> stor104: Swap: 0 0 0 >> [root@highlander ~]# >> >> # Gluster Info >> >> [root@stor106 glusterfs]# gluster volume info >> >> Volume Name: live >> Type: Distribute >> Volume ID: 1328637d-7730-4627-8945-bbe43626d527 >> Status: Started >> Number of Bricks: 9 >> Transport-type: tcp >> Bricks: >> Brick1: stor104:/zfs/brick0/brick >> Brick2: stor104:/zfs/brick1/brick >> Brick3: stor104:/zfs/brick2/brick >> Brick4: stor106:/zfs/brick0/brick >> Brick5: stor106:/zfs/brick1/brick >> Brick6: stor106:/zfs/brick2/brick >> Brick7: stor105:/zfs/brick0/brick >> Brick8: stor105:/zfs/brick1/brick >> Brick9: stor105:/zfs/brick2/brick >> Options Reconfigured: >> nfs.disable: true >> diagnostics.count-fop-hits: on >> diagnostics.latency-measurement: on >> performance.write-behind-window-size: 4MB >> performance.io-thread-count: 32 >> performance.client-io-threads: on >> performance.cache-size: 1GB >> performance.cache-refresh-timeout: 60 >> performance.cache-max-file-size: 4MB >> cluster.data-self-heal-algorithm: full >> diagnostics.client-log-level: ERROR >> diagnostics.brick-log-level: ERROR >> cluster.min-free-disk: 1% >> server.allow-insecure: on >> >> # Starting gluserd >> >> [root@highlander ~]# pdsh -g live 'systemctl start glusterd' >> [root@highlander ~]# pdsh -g live 'free -m' >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 2290 190569 9 389 190587 >> stor106: Swap: 0 0 0 >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 2297 190557 9 394 190571 >> stor104: Swap: 0 0 0 >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 2286 190554 9 407 190595 >> stor105: Swap: 0 0 0 >> >> [root@highlander ~]# systemctl start glusterd >> [root@highlander ~]# gluster volume start live >> volume start: live: success >> [root@highlander ~]# gluster volume status >> Status of volume: live >> Gluster process TCP Port RDMA Port Online Pid >> ------------------------------------------------------------------------------ >> Brick stor104:/zfs/brick0/brick 49164 0 Y 5945 >> Brick stor104:/zfs/brick1/brick 49165 0 Y 5963 >> Brick stor104:/zfs/brick2/brick 49166 0 Y 5981 >> Brick stor106:/zfs/brick0/brick 49158 0 Y 5256 >> Brick stor106:/zfs/brick1/brick 49159 0 Y 5274 >> Brick stor106:/zfs/brick2/brick 49160 0 Y 5292 >> Brick stor105:/zfs/brick0/brick 49155 0 Y 5284 >> Brick stor105:/zfs/brick1/brick 49156 0 Y 5302 >> Brick stor105:/zfs/brick2/brick 49157 0 Y 5320 >> NFS Server on localhost N/A N/A N N/A >> NFS Server on 192.168.123.106 N/A N/A N N/A >> NFS Server on stor105 N/A N/A N N/A >> NFS Server on 192.168.123.104 N/A N/A N N/A >> >> Task Status of Volume live >> ------------------------------------------------------------------------------ >> There are no active volume tasks >> >> [root@highlander ~]# >> >> # Memory usage of each node after 5 minutes >> >> Output of top: >> >> pdsh -g live 'top -n 1 -b' | fpaste >> >> http://paste.fedoraproject.org/256710/14399886/ >> >> [root@highlander ~]# pdsh -g live 'free -m' >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 6877 184154 9 2218 184250 >> stor106: Swap: 0 0 0 >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 22126 169351 9 1771 169403 >> stor105: Swap: 0 0 0 >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 2708 188638 9 1902 188687 >> stor104: Swap: 0 0 0 >> >> >> # Memory usage of each node after 45 minutes >> >> [root@highlander ~]# pdsh -g live 'free -m' >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 3131 184168 9 5949 184524 >> stor104: Swap: 0 0 0 >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 27919 158176 9 7153 158894 >> stor106: Swap: 0 0 0 >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 117096 70621 9 5530 70891 >> stor105: Swap: 0 0 0 >> >> http://paste.fedoraproject.org/256726/43999119 >> >> # Memory usage of each node after 90 minutes >> >> [root@highlander ~]# pdsh -g live 'free -m' >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 3390 181034 9 8825 181661 >> stor104: Swap: 0 0 0 >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 45780 136424 9 11044 137759 >> stor106: Swap: 0 0 0 >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 151483 33492 9 8272 33972 >> stor105: Swap: 0 0 0 >> >> http://paste.fedoraproject.org/256745/14399937 >> >> # Memory usage after 5 hours >> >> ```bash >> [root@highlander ~]# pdsh -g live 'free -m' >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 4645 163186 9 25417 165473 >> stor104: Swap: 0 0 0 >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 155094 14784 9 23369 16640 >> stor105: Swap: 0 0 0 >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 141379 16515 9 35355 23714 >> stor106: Swap: 0 0 0 >> ``` >> >> http://paste.fedoraproject.org/256879/44001235 >> >> # Memory usage after 6 hours >> >> ```bash >> [root@highlander ~]# pdsh -g live 'free -m' >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 140526 12207 9 40516 21612 >> stor106: Swap: 0 0 0 >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 102923 58748 9 31578 63632 >> stor104: Swap: 0 0 0 >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 155394 10876 9 26977 13154 >> stor105: Swap: 0 0 0 >> ``` >> >> http://paste.fedoraproject.org/256905/00168781 >> >> # Memory after 24 hours + Failed >> >> ```bash >> [root@highlander ~]# pdsh -g live 'free -m' >> stor105: total used free shared buff/cache available >> stor105: Mem: 193248 136123 6323 9 50801 10281 >> stor105: Swap: 0 0 0 >> stor104: total used free shared buff/cache available >> stor104: Mem: 193249 125320 2812 9 65116 17337 >> stor104: Swap: 0 0 0 >> stor106: total used free shared buff/cache available >> stor106: Mem: 193249 111997 13969 9 67282 19429 >> stor106: Swap: 0 0 0 >> [root@highlander ~]# >> ``` >> >> http://paste.fedoraproject.org/257254/14400880 >> >> # Failed logs >> >> ```bash >> [root@highlander ~]# gluster volume rebalance live status >> Node Rebalanced-files size scanned failures skipped status run time in secs >> --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- >> 192.168.123.104 748812 4.4TB 4160456 1311 156772 failed 63114.00 >> 192.168.123.106 1187917 3.3TB 6021931 21625 1209503 failed 75243.00 >> stor105 0 0Bytes 2440431 16 196 failed 69658.00 >> volume rebalance: live: success: >> ``` >> >> >> >> Dr Christophe Trefois, Dipl.-Ing. >> Technical Specialist / Post-Doc >> >> UNIVERSITÉ DU LUXEMBOURG >> >> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >> Campus Belval | House of Biomedicine >> 6, avenue du Swing >> L-4367 Belvaux >> T: +352 46 66 44 6124 >> F: +352 46 66 44 6949 >> http://www.uni.lu/lcsb >> >> >> >> ---- >> This message is confidential and may contain privileged information. >> It is intended for the named recipient only. >> If you receive it in error please notify me and permanently delete the original message and any copies. >> ---- >> >> >> >>> On 19 Aug 2015, at 08:14, Susant Palai <spalai@xxxxxxxxxx> wrote: >>> >>> Comments inline. >>> >>> ----- Original Message ----- >>>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >>>> To: "Susant Palai" <spalai@xxxxxxxxxx> >>>> Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, "Shyamsundar >>>> Ranganathan" <srangana@xxxxxxxxxx>, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel" >>>> <gluster-devel@xxxxxxxxxxx> >>>> Sent: Tuesday, August 18, 2015 8:08:41 PM >>>> Subject: Re: Skipped files during rebalance >>>> >>>> Hi Susan, >>>> >>>> Thank you for the response. >>>> >>>>> On 18 Aug 2015, at 10:45, Susant Palai <spalai@xxxxxxxxxx> wrote: >>>>> >>>>> Hi Christophe, >>>>> >>>>> Need some info regarding the high mem-usage. >>>>> >>>>> 1. Top output: To see whether any other process eating up memory. >>> I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs). >>> >>> >>>>> 2. Gluster volume info >>>> root@highlander ~]# gluster volume info >>>> >>>> Volume Name: live >>>> Type: Distribute >>>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527 >>>> Status: Started >>>> Number of Bricks: 9 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: stor104:/zfs/brick0/brick >>>> Brick2: stor104:/zfs/brick1/brick >>>> Brick3: stor104:/zfs/brick2/brick >>>> Brick4: stor106:/zfs/brick0/brick >>>> Brick5: stor106:/zfs/brick1/brick >>>> Brick6: stor106:/zfs/brick2/brick >>>> Brick7: stor105:/zfs/brick0/brick >>>> Brick8: stor105:/zfs/brick1/brick >>>> Brick9: stor105:/zfs/brick2/brick >>>> Options Reconfigured: >>>> diagnostics.count-fop-hits: on >>>> diagnostics.latency-measurement: on >>>> server.allow-insecure: on >>>> cluster.min-free-disk: 1% >>>> diagnostics.brick-log-level: ERROR >>>> diagnostics.client-log-level: ERROR >>>> cluster.data-self-heal-algorithm: full >>>> performance.cache-max-file-size: 4MB >>>> performance.cache-refresh-timeout: 60 >>>> performance.cache-size: 1GB >>>> performance.client-io-threads: on >>>> performance.io-thread-count: 32 >>>> performance.write-behind-window-size: 4MB >>>> >>>>> 3. Is rebalance process still running? If yes can you point to specific mem >>>>> usage by rebalance process? The high mem-usage was seen during rebalance >>>>> or even post rebalance? >>>> I would like to restart the rebalance process since it failed… But I can’t as >>>> the volume cannot be stopped (I wanted to reboot the servers to have a clean >>>> testing grounds). >>>> >>>> Here are the logs from the three nodes: >>>> http://paste.fedoraproject.org/256183/43989079 >>>> >>>> Maybe you could help me figure out how to stop the volume? >>>> >>>> This is what happens >>>> >>>> [root@highlander ~]# gluster volume rebalance live stop >>>> volume rebalance: live: failed: Rebalance not started. >>> Requesting glusterd team to give input. >>>> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop" >>>> volume rebalance: live: failed: Rebalance not started. >>>> >>>> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop" >>>> volume rebalance: live: failed: Rebalance not started. >>>> >>>> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop" >>>> volume rebalance: live: failed: Rebalance not started. >>>> >>>> [root@highlander ~]# gluster volume rebalance live stop >>>> volume rebalance: live: failed: Rebalance not started. >>>> >>>> [root@highlander ~]# gluster volume stop live >>>> Stopping volume will make its data inaccessible. Do you want to continue? >>>> (y/n) y >>>> volume stop: live: failed: Staging failed on stor106. Error: rebalance >>>> session is in progress for the volume 'live' >>>> Staging failed on stor104. Error: rebalance session is in progress for the >>>> volume ‘live' >>> Can you run [ps aux | grep "rebalance"] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs. >>> >>>> >>>>> 4. Gluster version >>>> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster' >>>> stor104: glusterfs-api-3.7.3-1.el7.x86_64 >>>> stor104: glusterfs-server-3.7.3-1.el7.x86_64 >>>> stor104: glusterfs-libs-3.7.3-1.el7.x86_64 >>>> stor104: glusterfs-3.7.3-1.el7.x86_64 >>>> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64 >>>> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >>>> stor104: glusterfs-cli-3.7.3-1.el7.x86_64 >>>> >>>> stor105: glusterfs-3.7.3-1.el7.x86_64 >>>> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >>>> stor105: glusterfs-api-3.7.3-1.el7.x86_64 >>>> stor105: glusterfs-cli-3.7.3-1.el7.x86_64 >>>> stor105: glusterfs-server-3.7.3-1.el7.x86_64 >>>> stor105: glusterfs-libs-3.7.3-1.el7.x86_64 >>>> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64 >>>> >>>> stor106: glusterfs-libs-3.7.3-1.el7.x86_64 >>>> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64 >>>> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64 >>>> stor106: glusterfs-api-3.7.3-1.el7.x86_64 >>>> stor106: glusterfs-cli-3.7.3-1.el7.x86_64 >>>> stor106: glusterfs-server-3.7.3-1.el7.x86_64 >>>> stor106: glusterfs-3.7.3-1.el7.x86_64 >>>> >>>>> Will ask for more information in case needed. >>>>> >>>>> Regards, >>>>> Susant >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx> >>>>>> To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" >>>>>> <nbalacha@xxxxxxxxxx>, "Susant Palai" >>>>>> <spalai@xxxxxxxxxx>, "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx> >>>>>> Cc: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx> >>>>>> Sent: Monday, 17 August, 2015 7:03:20 PM >>>>>> Subject: Fwd: Skipped files during rebalance >>>>>> >>>>>> Hi DHT team, >>>>>> >>>>>> This email somehow didn’t get forwarded to you. >>>>>> >>>>>> In addition to my problem described below, here is one example of free >>>>>> memory >>>>>> after everything failed >>>>>> >>>>>> [root@highlander ~]# pdsh -g live 'free -m' >>>>>> stor106: total used free shared >>>>>> buff/cache >>>>>> available >>>>>> stor106: Mem: 193249 124784 1347 9 >>>>>> 67118 >>>>>> 12769 >>>>>> stor106: Swap: 0 0 0 >>>>>> stor104: total used free shared >>>>>> buff/cache >>>>>> available >>>>>> stor104: Mem: 193249 107617 31323 9 >>>>>> 54308 >>>>>> 42752 >>>>>> stor104: Swap: 0 0 0 >>>>>> stor105: total used free shared >>>>>> buff/cache >>>>>> available >>>>>> stor105: Mem: 193248 141804 6736 9 >>>>>> 44707 >>>>>> 9713 >>>>>> stor105: Swap: 0 0 0 >>>>>> >>>>>> So after the failed operation, there’s almost no memory free, and it is >>>>>> also >>>>>> not freed up. >>>>>> >>>>>> Thank you for pointing me to any directions, >>>>>> >>>>>> Kind regards, >>>>>> >>>>>> — >>>>>> Christophe >>>>>> >>>>>> >>>>>> Begin forwarded message: >>>>>> >>>>>> From: Christophe TREFOIS >>>>>> <christophe.trefois@xxxxxx<mailto:christophe.trefois@xxxxxx>> >>>>>> Subject: Re: Skipped files during rebalance >>>>>> Date: 17 Aug 2015 11:54:32 CEST >>>>>> To: Mohammed Rafi K C <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> >>>>>> Cc: "gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>" >>>>>> <gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>> >>>>>> >>>>>> Dear Rafi, >>>>>> >>>>>> Thanks for submitting a patch. >>>>>> >>>>>> @DHT, I have two additional questions / problems. >>>>>> >>>>>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes >>>>>> dramatically high, eg out of 196 GB available per node, RAM usage would >>>>>> fill >>>>>> up to 195.6 GB. This seems quite excessive and strange to me. >>>>>> >>>>>> 2. As you can see, the rebalance (with data) failed as one endpoint >>>>>> becomes >>>>>> unconnected (even though it still is connected). I’m thinking this could >>>>>> be >>>>>> due to the high RAM usage? >>>>>> >>>>>> Thank you for your help, >>>>>> >>>>>> — >>>>>> Christophe >>>>>> >>>>>> Dr Christophe Trefois, Dipl.-Ing. >>>>>> Technical Specialist / Post-Doc >>>>>> >>>>>> UNIVERSITÉ DU LUXEMBOURG >>>>>> >>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>>> Campus Belval | House of Biomedicine >>>>>> 6, avenue du Swing >>>>>> L-4367 Belvaux >>>>>> T: +352 46 66 44 6124 >>>>>> F: +352 46 66 44 6949 >>>>>> http://www.uni.lu/lcsb >>>>>> >>>>>> [Facebook]<https://www.facebook.com/trefex> [Twitter] >>>>>> <https://twitter.com/Trefex> [Google Plus] >>>>>> <https://plus.google.com/+ChristopheTrefois/> [Linkedin] >>>>>> <https://www.linkedin.com/in/trefoischristophe> [skype] >>>>>> <http://skype:Trefex?call> >>>>>> >>>>>> >>>>>> ---- >>>>>> This message is confidential and may contain privileged information. >>>>>> It is intended for the named recipient only. >>>>>> If you receive it in error please notify me and permanently delete the >>>>>> original message and any copies. >>>>>> ---- >>>>>> >>>>>> >>>>>> >>>>>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C >>>>>> <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote: >>>>>> Dear all, >>>>>> >>>>>> I have successfully added a new node to our setup, and finally managed to >>>>>> get >>>>>> a successful fix-layout run as well with no errors. >>>>>> >>>>>> Now, as per the documentation, I started a gluster volume rebalance live >>>>>> start task and I see many skipped files. >>>>>> The error log contains then entires as follows for each skipped file. >>>>>> >>>>>> [2015-08-16 20:23:30.591161] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed >>>>>> [2015-08-16 20:23:30.768391] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed >>>>>> [2015-08-16 20:23:30.804811] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed >>>>>> [2015-08-16 20:23:30.805201] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed >>>>>> [2015-08-16 20:23:30.880037] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed >>>>>> [2015-08-16 20:23:31.038236] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed >>>>>> [2015-08-16 20:23:31.259762] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed >>>>>> [2015-08-16 20:23:31.333764] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed >>>>>> [2015-08-16 20:23:31.340190] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea >>>>>> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed >>>>>> >>>>>> Update: one of the rebalance tasks now failed. >>>>>> >>>>>> @Rafi, I got the same error as Friday except this time with data. >>>>>> >>>>>> Packets that carrying the ping request could be waiting in the queue >>>>>> during >>>>>> the whole time-out period, because of the heavy traffic in the network. I >>>>>> have sent a patch for this. You can track the status here : >>>>>> http://review.gluster.org/11935 >>>>>> >>>>>> >>>>>> >>>>>> [2015-08-16 20:24:34.533167] C >>>>>> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server >>>>>> 192.168.123.104:49164 has not responded in the last 42 seconds, >>>>>> disconnecting. >>>>>> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>>>> d+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/li >>>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: >>>>>> forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at >>>>>> 2015-08-16 20:23:51.305640 (xid=0x5dd4da) >>>>>> [2015-08-16 20:24:34.533672] E [MSGID: 114031] >>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote >>>>>> operation failed [Transport endpoint is not connected] >>>>>> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>>>> d+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/li >>>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: >>>>>> forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at >>>>>> 2015-08-16 >>>>>> 20:23:51.303938 (xid=0x5dd4d7) >>>>>> [2015-08-16 20:24:34.534347] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>>> /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_ >>>>>> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data >>>>>> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin >>>>>> d+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>>> called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8) >>>>>> [2015-08-16 20:24:34.534579] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>>> /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex: >>>>>> failed to migrate data >>>>>> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>>> called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db) >>>>>> [2015-08-16 20:24:34.534745] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>>> /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex: >>>>>> failed to migrate data >>>>>> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>>> called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc) >>>>>> [2015-08-16 20:24:34.535232] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>>> /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex: >>>>>> failed to migrate data >>>>>> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) >>>>>> called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd) >>>>>> [2015-08-16 20:24:34.536069] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: >>>>>> /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex: >>>>>> failed to migrate data >>>>>> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>>>> called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de) >>>>>> [2015-08-16 20:24:34.536339] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex >>>>>> lookup failed >>>>>> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>>>> called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df) >>>>>> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>>>> called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0) >>>>>> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>>>> called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1) >>>>>> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) >>>>>> called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2) >>>>>> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (--> >>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> >>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] >>>>>> (--> >>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) >>>>>> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called >>>>>> at >>>>>> 2015-08-16 20:23:52.530107 (xid=0x5dd4e3) >>>>>> [2015-08-16 20:24:34.538475] E [MSGID: 114031] >>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote >>>>>> operation failed [Transport endpoint is not connected] >>>>>> The message "E [MSGID: 114031] >>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] >>>>>> 0-live-client-0: remote operation failed [Transport endpoint is not >>>>>> connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and >>>>>> [2015-08-16 20:24:34.538535] >>>>>> [2015-08-16 20:24:34.538584] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate >>>>>> file failed: 002004003.flex lookup failed >>>>>> [2015-08-16 20:24:34.538904] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate >>>>>> file failed: 003009008.flex lookup failed >>>>>> [2015-08-16 20:24:34.539724] E [MSGID: 109023] >>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file >>>>>> failed:/hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex >>>>>> lookup failed >>>>>> [2015-08-16 20:24:34.539820] E [MSGID: 109016] >>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>>>> for /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25) >>>>>> [2015-08-16 20:24:34.540031] E [MSGID: 109016] >>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>>>> for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1 >>>>>> [2015-08-16 20:24:34.540691] E [MSGID: 114031] >>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex >>>>>> [Transport endpoint is not connected] >>>>>> [2015-08-16 20:24:34.541152] E [MSGID: 114031] >>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex >>>>>> [Transport endpoint is not connected] >>>>>> [2015-08-16 20:24:34.541331] E [MSGID: 114031] >>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote >>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK >>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex >>>>>> [Transport endpoint is not connected] >>>>>> [2015-08-16 20:24:34.541486] E [MSGID: 109016] >>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>>>> for /hcs/hcs/OperaArchiveCol >>>>>> [2015-08-16 20:24:34.541572] E [MSGID: 109016] >>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>>>> for /hcs/hcs >>>>>> [2015-08-16 20:24:34.541639] E [MSGID: 109016] >>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed >>>>>> for /hcs >>>>>> >>>>>> Any help would be greatly appreciated. >>>>>> CCing dht teams to give you better idea about why rebalance failed/ and >>>>>> about >>>>>> huge memory consumption by rebalance process (200GB RAM) . >>>>>> >>>>>> Regards >>>>>> Rafi KC >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> -- >>>>>> Christophe >>>>>> >>>>>> Dr Christophe Trefois, Dipl.-Ing. >>>>>> Technical Specialist / Post-Doc >>>>>> >>>>>> UNIVERSITÉ DU LUXEMBOURG >>>>>> >>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>>> Campus Belval | House of Biomedicine >>>>>> 6, avenue du Swing >>>>>> L-4367 Belvaux >>>>>> T: +352 46 66 44 6124 >>>>>> F: +352 46 66 44 6949 >>>>>> http://www.uni.lu/lcsb >>>>>> >>>>>> ---- >>>>>> This message is confidential and may contain privileged information. >>>>>> It is intended for the named recipient only. >>>>>> If you receive it in error please notify me and permanently delete the >>>>>> original message and any copies. >>>>>> ---- >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-devel mailing list >>>>>> Gluster-devel@xxxxxxxxxxx<mailto:Gluster-devel@xxxxxxxxxxx> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@xxxxxxxxxxx >> http://www.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel