Re: Fwd: Skipped files during rebalance

Mohammed Rafi K C <rkavunga@xxxxxxxxxx> · Mon, 24 Aug 2015 12:37:50 +0530

>
>
> Dear Susant,
>
> Do you think the patch submitted by Rafi could help with this?
Yes, unless there is a network failure/problem in your setup.

>
> The nodes are on the same network in the same rack and as such should have no connectivity issues.

I could see DNS resolution failure in log messages, can you cross check
the connection setup.

>
> Is it possible that the processes on nodes 104 and 106 were too “busy” and unable to accept new connections?
>
> Any helpers would be appreciated,
>
> —
> Christophe
>
> Dr Christophe Trefois, Dipl.-Ing.  
> Technical Specialist / Post-Doc
>
> UNIVERSITÉ DU LUXEMBOURG
>
> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
> Campus Belval | House of Biomedicine  
> 6, avenue du Swing 
> L-4367 Belvaux  
> T: +352 46 66 44 6124 
> F: +352 46 66 44 6949  
> http://www.uni.lu/lcsb
>
>         
>
> ----
> This message is confidential and may contain privileged information. 
> It is intended for the named recipient only. 
> If you receive it in error please notify me and permanently delete the original message and any copies. 
> ----
>
>   
>
>> On 21 Aug 2015, at 14:57, Susant Palai <spalai@xxxxxxxxxx> wrote:
>>
>> Hi,
>> Mostly the rebalance failures are due to the network problem.
>>
>> Here is the log:
>>
>> [2015-08-16 20:31:36.301467] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003002002.flex lookup failed
>> [2015-08-16 20:31:36.921405] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/003004005.flex lookup failed
>> [2015-08-16 20:31:36.921591] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/006004004.flex lookup failed
>> [2015-08-16 20:31:36.921770] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/005004007.flex lookup failed
>> [2015-08-16 20:31:37.577758] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/PA 27112012_ATCC_Fibroblasts_Chem/Meas_10(2012-11-27_20-15-48)/007004005.flex lookup failed
>> [2015-08-16 20:34:12.387425] E [socket.c:2332:socket_connect_finish] 0-live-client-4: connection to 192.168.123.106:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.392820] E [socket.c:2332:socket_connect_finish] 0-live-client-5: connection to 192.168.123.106:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.398023] E [socket.c:2332:socket_connect_finish] 0-live-client-0: connection to 192.168.123.104:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.402904] E [socket.c:2332:socket_connect_finish] 0-live-client-2: connection to 192.168.123.104:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.407464] E [socket.c:2332:socket_connect_finish] 0-live-client-3: connection to 192.168.123.106:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.412249] E [socket.c:2332:socket_connect_finish] 0-live-client-1: connection to 192.168.123.104:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.416621] E [socket.c:2332:socket_connect_finish] 0-live-client-6: connection to 192.168.123.105:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.420906] E [socket.c:2332:socket_connect_finish] 0-live-client-8: connection to 192.168.123.105:24007 failed (Connection refused)
>> [2015-08-16 20:34:12.425066] E [socket.c:2332:socket_connect_finish] 0-live-client-7: connection to 192.168.123.105:24007 failed (Connection refused)
>> [2015-08-16 20:34:17.479925] E [socket.c:2332:socket_connect_finish] 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
>> [2015-08-16 20:36:23.788206] E [MSGID: 101075] [common-utils.c:314:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
>> [2015-08-16 20:36:23.788286] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-4: DNS resolution failed on host stor106
>> [2015-08-16 20:36:23.788387] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-5: DNS resolution failed on host stor106
>> [2015-08-16 20:36:23.788918] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-0: DNS resolution failed on host stor104
>> [2015-08-16 20:36:23.789233] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-2: DNS resolution failed on host stor104
>> [2015-08-16 20:36:23.789295] E [name.c:247:af_inet_client_get_remote_sockaddr] 0-live-client-3: DNS resolution failed on host stor106
>>
>>
>> For the high mem usage part I will try to run rebalance and analyze. In the mean time it will be help full if you can take a state dump of the rebalance process when it is using high RAM.
>>
>> Here are the steps to take the state dump.
>>
>> 1. Find your state-dump destination; Run "gluster --print-statedumpdir". The state dump will be stored in this location.
>>
>> 2. When you see any of the rebalance process on any of the servers using high memory issue the following command.
>>   "kill -USR1 <pid-of-rebalance-process>".  ---> ps aux | grep rebalance should give the rebalance process pid.
>>
>> The state dump should give some hint about the high mem-usage.
>>
>> Thanks,
>> Susant
>>
>> ----- Original Message -----
>> From: "Susant Palai" <spalai@xxxxxxxxxx>
>> To: "Christophe TREFOIS" <christophe.trefois@xxxxxx>
>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
>> Sent: Friday, 21 August, 2015 3:52:07 PM
>> Subject: Re:  Skipped files during rebalance
>>
>> Thanks Christophe for the details. Will get back to you with the analysis.
>>
>> Regards,
>> Susant
>>
>> ----- Original Message -----
>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx>
>> To: "Susant Palai" <spalai@xxxxxxxxxx>
>> Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx>, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
>> Sent: Friday, 21 August, 2015 12:39:05 AM
>> Subject: Re:  Skipped files during rebalance
>>
>> Dear Susant,
>>
>> The rebalance failed again and also had (in my opinion) excessive RAM usage.
>>
>> Please find a very detailled list below.
>>
>> All logs:
>>
>> http://wikisend.com/download/651948/allstores.tar.gz
>>
>> Thank you for letting me know how I could successfully complete the rebalance process.
>> The fedora pastes are the output of top of each node at that time (more or less).
>>
>> Please let me know if you need more information,
>>
>> Best,
>>
>> —— Start of mem info
>>
>> # After reboot, before starting glusterd
>>
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249        2208      190825           9         215      190772
>> stor106: Swap:             0           0           0
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248        2275      190738           9         234      190681
>> stor105: Swap:             0           0           0
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249        2221      190811           9         216      190757
>> stor104: Swap:             0           0           0
>> [root@highlander ~]#
>>
>> # Gluster Info
>>
>> [root@stor106 glusterfs]# gluster volume info
>>
>> Volume Name: live
>> Type: Distribute
>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>> Status: Started
>> Number of Bricks: 9
>> Transport-type: tcp
>> Bricks:
>> Brick1: stor104:/zfs/brick0/brick
>> Brick2: stor104:/zfs/brick1/brick
>> Brick3: stor104:/zfs/brick2/brick
>> Brick4: stor106:/zfs/brick0/brick
>> Brick5: stor106:/zfs/brick1/brick
>> Brick6: stor106:/zfs/brick2/brick
>> Brick7: stor105:/zfs/brick0/brick
>> Brick8: stor105:/zfs/brick1/brick
>> Brick9: stor105:/zfs/brick2/brick
>> Options Reconfigured:
>> nfs.disable: true
>> diagnostics.count-fop-hits: on
>> diagnostics.latency-measurement: on
>> performance.write-behind-window-size: 4MB
>> performance.io-thread-count: 32
>> performance.client-io-threads: on
>> performance.cache-size: 1GB
>> performance.cache-refresh-timeout: 60
>> performance.cache-max-file-size: 4MB
>> cluster.data-self-heal-algorithm: full
>> diagnostics.client-log-level: ERROR
>> diagnostics.brick-log-level: ERROR
>> cluster.min-free-disk: 1%
>> server.allow-insecure: on
>>
>> # Starting gluserd
>>
>> [root@highlander ~]# pdsh -g live 'systemctl start glusterd'
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249        2290      190569           9         389      190587
>> stor106: Swap:             0           0           0
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249        2297      190557           9         394      190571
>> stor104: Swap:             0           0           0
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248        2286      190554           9         407      190595
>> stor105: Swap:             0           0           0
>>
>> [root@highlander ~]# systemctl start glusterd
>> [root@highlander ~]# gluster volume start live
>> volume start: live: success
>> [root@highlander ~]# gluster volume status
>> Status of volume: live
>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>> ------------------------------------------------------------------------------
>> Brick stor104:/zfs/brick0/brick             49164     0          Y       5945
>> Brick stor104:/zfs/brick1/brick             49165     0          Y       5963
>> Brick stor104:/zfs/brick2/brick             49166     0          Y       5981
>> Brick stor106:/zfs/brick0/brick             49158     0          Y       5256
>> Brick stor106:/zfs/brick1/brick             49159     0          Y       5274
>> Brick stor106:/zfs/brick2/brick             49160     0          Y       5292
>> Brick stor105:/zfs/brick0/brick             49155     0          Y       5284
>> Brick stor105:/zfs/brick1/brick             49156     0          Y       5302
>> Brick stor105:/zfs/brick2/brick             49157     0          Y       5320
>> NFS Server on localhost                     N/A       N/A        N       N/A
>> NFS Server on 192.168.123.106               N/A       N/A        N       N/A
>> NFS Server on stor105                       N/A       N/A        N       N/A
>> NFS Server on 192.168.123.104               N/A       N/A        N       N/A
>>
>> Task Status of Volume live
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>>
>> [root@highlander ~]#
>>
>> # Memory usage of each node after 5 minutes
>>
>> Output of top:
>>
>> pdsh -g live 'top -n 1 -b' | fpaste
>>
>> http://paste.fedoraproject.org/256710/14399886/
>>
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249        6877      184154           9        2218      184250
>> stor106: Swap:             0           0           0
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248       22126      169351           9        1771      169403
>> stor105: Swap:             0           0           0
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249        2708      188638           9        1902      188687
>> stor104: Swap:             0           0           0
>>
>>
>> # Memory usage of each node after 45 minutes
>>
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249        3131      184168           9        5949      184524
>> stor104: Swap:             0           0           0
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249       27919      158176           9        7153      158894
>> stor106: Swap:             0           0           0
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248      117096       70621           9        5530       70891
>> stor105: Swap:             0           0           0
>>
>> http://paste.fedoraproject.org/256726/43999119
>>
>> # Memory usage of each node after 90 minutes
>>
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249        3390      181034           9        8825      181661
>> stor104: Swap:             0           0           0
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249       45780      136424           9       11044      137759
>> stor106: Swap:             0           0           0
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248      151483       33492           9        8272       33972
>> stor105: Swap:             0           0           0
>>
>> http://paste.fedoraproject.org/256745/14399937
>>
>> # Memory usage after 5 hours
>>
>> ```bash
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249        4645      163186           9       25417      165473
>> stor104: Swap:             0           0           0
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248      155094       14784           9       23369       16640
>> stor105: Swap:             0           0           0
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249      141379       16515           9       35355       23714
>> stor106: Swap:             0           0           0
>> ```
>>
>> http://paste.fedoraproject.org/256879/44001235
>>
>> # Memory usage after 6 hours
>>
>> ```bash
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249      140526       12207           9       40516       21612
>> stor106: Swap:             0           0           0
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249      102923       58748           9       31578       63632
>> stor104: Swap:             0           0           0
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248      155394       10876           9       26977       13154
>> stor105: Swap:             0           0           0
>> ```
>>
>> http://paste.fedoraproject.org/256905/00168781
>>
>> # Memory after 24 hours + Failed
>>
>> ```bash
>> [root@highlander ~]# pdsh -g live 'free -m'
>> stor105:               total        used        free      shared  buff/cache   available
>> stor105: Mem:         193248      136123        6323           9       50801       10281
>> stor105: Swap:             0           0           0
>> stor104:               total        used        free      shared  buff/cache   available
>> stor104: Mem:         193249      125320        2812           9       65116       17337
>> stor104: Swap:             0           0           0
>> stor106:               total        used        free      shared  buff/cache   available
>> stor106: Mem:         193249      111997       13969           9       67282       19429
>> stor106: Swap:             0           0           0
>> [root@highlander ~]#
>> ```
>>
>> http://paste.fedoraproject.org/257254/14400880
>>
>> # Failed logs
>>
>> ```bash
>> [root@highlander ~]# gluster volume rebalance live status
>>                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
>>                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
>>                         192.168.123.104           748812         4.4TB       4160456          1311        156772               failed           63114.00
>>                         192.168.123.106          1187917         3.3TB       6021931         21625       1209503               failed           75243.00
>>                                 stor105                0        0Bytes       2440431            16           196               failed           69658.00
>> volume rebalance: live: success:
>> ```
>>
>>
>>
>> Dr Christophe Trefois, Dipl.-Ing.  
>> Technical Specialist / Post-Doc
>>
>> UNIVERSITÉ DU LUXEMBOURG
>>
>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>> Campus Belval | House of Biomedicine  
>> 6, avenue du Swing 
>> L-4367 Belvaux  
>> T: +352 46 66 44 6124 
>> F: +352 46 66 44 6949  
>> http://www.uni.lu/lcsb
>>
>>
>>
>> ----
>> This message is confidential and may contain privileged information. 
>> It is intended for the named recipient only. 
>> If you receive it in error please notify me and permanently delete the original message and any copies. 
>> ----
>>
>>
>>
>>> On 19 Aug 2015, at 08:14, Susant Palai <spalai@xxxxxxxxxx> wrote:
>>>
>>> Comments inline.
>>>
>>> ----- Original Message -----
>>>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx>
>>>> To: "Susant Palai" <spalai@xxxxxxxxxx>
>>>> Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, "Shyamsundar
>>>> Ranganathan" <srangana@xxxxxxxxxx>, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel"
>>>> <gluster-devel@xxxxxxxxxxx>
>>>> Sent: Tuesday, August 18, 2015 8:08:41 PM
>>>> Subject: Re:  Skipped files during rebalance
>>>>
>>>> Hi Susan,
>>>>
>>>> Thank you for the response.
>>>>
>>>>> On 18 Aug 2015, at 10:45, Susant Palai <spalai@xxxxxxxxxx> wrote:
>>>>>
>>>>> Hi Christophe,
>>>>>
>>>>> Need some info regarding the high mem-usage.
>>>>>
>>>>> 1. Top output: To see whether any other process eating up memory.
>>> I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs).
>>>
>>>
>>>>> 2. Gluster volume info
>>>> root@highlander ~]# gluster volume info
>>>>
>>>> Volume Name: live
>>>> Type: Distribute
>>>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>>>> Status: Started
>>>> Number of Bricks: 9
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: stor104:/zfs/brick0/brick
>>>> Brick2: stor104:/zfs/brick1/brick
>>>> Brick3: stor104:/zfs/brick2/brick
>>>> Brick4: stor106:/zfs/brick0/brick
>>>> Brick5: stor106:/zfs/brick1/brick
>>>> Brick6: stor106:/zfs/brick2/brick
>>>> Brick7: stor105:/zfs/brick0/brick
>>>> Brick8: stor105:/zfs/brick1/brick
>>>> Brick9: stor105:/zfs/brick2/brick
>>>> Options Reconfigured:
>>>> diagnostics.count-fop-hits: on
>>>> diagnostics.latency-measurement: on
>>>> server.allow-insecure: on
>>>> cluster.min-free-disk: 1%
>>>> diagnostics.brick-log-level: ERROR
>>>> diagnostics.client-log-level: ERROR
>>>> cluster.data-self-heal-algorithm: full
>>>> performance.cache-max-file-size: 4MB
>>>> performance.cache-refresh-timeout: 60
>>>> performance.cache-size: 1GB
>>>> performance.client-io-threads: on
>>>> performance.io-thread-count: 32
>>>> performance.write-behind-window-size: 4MB
>>>>
>>>>> 3. Is rebalance process still running? If yes can you point to specific mem
>>>>> usage by rebalance process? The high mem-usage was seen during rebalance
>>>>> or even post rebalance?
>>>> I would like to restart the rebalance process since it failed… But I can’t as
>>>> the volume cannot be stopped (I wanted to reboot the servers to have a clean
>>>> testing grounds).
>>>>
>>>> Here are the logs from the three nodes:
>>>> http://paste.fedoraproject.org/256183/43989079
>>>>
>>>> Maybe you could help me figure out how to stop the volume?
>>>>
>>>> This is what happens
>>>>
>>>> [root@highlander ~]# gluster volume rebalance live stop
>>>> volume rebalance: live: failed: Rebalance not started.
>>> Requesting glusterd team to give input. 
>>>> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
>>>> volume rebalance: live: failed: Rebalance not started.
>>>>
>>>> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
>>>> volume rebalance: live: failed: Rebalance not started.
>>>>
>>>> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
>>>> volume rebalance: live: failed: Rebalance not started.
>>>>
>>>> [root@highlander ~]# gluster volume rebalance live stop
>>>> volume rebalance: live: failed: Rebalance not started.
>>>>
>>>> [root@highlander ~]# gluster volume stop live
>>>> Stopping volume will make its data inaccessible. Do you want to continue?
>>>> (y/n) y
>>>> volume stop: live: failed: Staging failed on stor106. Error: rebalance
>>>> session is in progress for the volume 'live'
>>>> Staging failed on stor104. Error: rebalance session is in progress for the
>>>> volume ‘live'
>>> Can you run [ps aux |  grep "rebalance"] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs.
>>>
>>>>
>>>>> 4. Gluster version
>>>> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
>>>> stor104: glusterfs-api-3.7.3-1.el7.x86_64
>>>> stor104: glusterfs-server-3.7.3-1.el7.x86_64
>>>> stor104: glusterfs-libs-3.7.3-1.el7.x86_64
>>>> stor104: glusterfs-3.7.3-1.el7.x86_64
>>>> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
>>>> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>>>> stor104: glusterfs-cli-3.7.3-1.el7.x86_64
>>>>
>>>> stor105: glusterfs-3.7.3-1.el7.x86_64
>>>> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>>>> stor105: glusterfs-api-3.7.3-1.el7.x86_64
>>>> stor105: glusterfs-cli-3.7.3-1.el7.x86_64
>>>> stor105: glusterfs-server-3.7.3-1.el7.x86_64
>>>> stor105: glusterfs-libs-3.7.3-1.el7.x86_64
>>>> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
>>>>
>>>> stor106: glusterfs-libs-3.7.3-1.el7.x86_64
>>>> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
>>>> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>>>> stor106: glusterfs-api-3.7.3-1.el7.x86_64
>>>> stor106: glusterfs-cli-3.7.3-1.el7.x86_64
>>>> stor106: glusterfs-server-3.7.3-1.el7.x86_64
>>>> stor106: glusterfs-3.7.3-1.el7.x86_64
>>>>
>>>>> Will ask for more information in case needed.
>>>>>
>>>>> Regards,
>>>>> Susant
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx>
>>>>>> To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran"
>>>>>> <nbalacha@xxxxxxxxxx>, "Susant Palai"
>>>>>> <spalai@xxxxxxxxxx>, "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx>
>>>>>> Cc: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>
>>>>>> Sent: Monday, 17 August, 2015 7:03:20 PM
>>>>>> Subject: Fwd:  Skipped files during rebalance
>>>>>>
>>>>>> Hi DHT team,
>>>>>>
>>>>>> This email somehow didn’t get forwarded to you.
>>>>>>
>>>>>> In addition to my problem described below, here is one example of free
>>>>>> memory
>>>>>> after everything failed
>>>>>>
>>>>>> [root@highlander ~]# pdsh -g live 'free -m'
>>>>>> stor106:               total        used        free      shared
>>>>>> buff/cache
>>>>>> available
>>>>>> stor106: Mem:         193249      124784        1347           9
>>>>>> 67118
>>>>>> 12769
>>>>>> stor106: Swap:             0           0           0
>>>>>> stor104:               total        used        free      shared
>>>>>> buff/cache
>>>>>> available
>>>>>> stor104: Mem:         193249      107617       31323           9
>>>>>> 54308
>>>>>> 42752
>>>>>> stor104: Swap:             0           0           0
>>>>>> stor105:               total        used        free      shared
>>>>>> buff/cache
>>>>>> available
>>>>>> stor105: Mem:         193248      141804        6736           9
>>>>>> 44707
>>>>>> 9713
>>>>>> stor105: Swap:             0           0           0
>>>>>>
>>>>>> So after the failed operation, there’s almost no memory free, and it is
>>>>>> also
>>>>>> not freed up.
>>>>>>
>>>>>> Thank you for pointing me to any directions,
>>>>>>
>>>>>> Kind regards,
>>>>>>
>>>>>> —
>>>>>> Christophe
>>>>>>
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>> From: Christophe TREFOIS
>>>>>> <christophe.trefois@xxxxxx<mailto:christophe.trefois@xxxxxx>>
>>>>>> Subject: Re:  Skipped files during rebalance
>>>>>> Date: 17 Aug 2015 11:54:32 CEST
>>>>>> To: Mohammed Rafi K C <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>>
>>>>>> Cc: "gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>"
>>>>>> <gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>>
>>>>>>
>>>>>> Dear Rafi,
>>>>>>
>>>>>> Thanks for submitting a patch.
>>>>>>
>>>>>> @DHT, I have two additional questions / problems.
>>>>>>
>>>>>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes
>>>>>> dramatically high, eg out of 196 GB available per node, RAM usage would
>>>>>> fill
>>>>>> up to 195.6 GB. This seems quite excessive and strange to me.
>>>>>>
>>>>>> 2. As you can see, the rebalance (with data) failed as one endpoint
>>>>>> becomes
>>>>>> unconnected (even though it still is connected). I’m thinking this could
>>>>>> be
>>>>>> due to the high RAM usage?
>>>>>>
>>>>>> Thank you for your help,
>>>>>>
>>>>>> —
>>>>>> Christophe
>>>>>>
>>>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>>>> Technical Specialist / Post-Doc
>>>>>>
>>>>>> UNIVERSITÉ DU LUXEMBOURG
>>>>>>
>>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>>>> Campus Belval | House of Biomedicine
>>>>>> 6, avenue du Swing
>>>>>> L-4367 Belvaux
>>>>>> T: +352 46 66 44 6124
>>>>>> F: +352 46 66 44 6949
>>>>>> http://www.uni.lu/lcsb
>>>>>>
>>>>>> [Facebook]<https://www.facebook.com/trefex>  [Twitter]
>>>>>> <https://twitter.com/Trefex>   [Google Plus]
>>>>>> <https://plus.google.com/+ChristopheTrefois/>   [Linkedin]
>>>>>> <https://www.linkedin.com/in/trefoischristophe>   [skype]
>>>>>> <http://skype:Trefex?call>
>>>>>>
>>>>>>
>>>>>> ----
>>>>>> This message is confidential and may contain privileged information.
>>>>>> It is intended for the named recipient only.
>>>>>> If you receive it in error please notify me and permanently delete the
>>>>>> original message and any copies.
>>>>>> ----
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C
>>>>>> <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I have successfully added a new node to our setup, and finally managed to
>>>>>> get
>>>>>> a successful fix-layout run as well with no errors.
>>>>>>
>>>>>> Now, as per the documentation, I started a gluster volume rebalance live
>>>>>> start task and I see many skipped files.
>>>>>> The error log contains then entires as follows for each skipped file.
>>>>>>
>>>>>> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
>>>>>> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
>>>>>> [2015-08-16 20:23:30.804811] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
>>>>>> [2015-08-16 20:23:30.805201] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
>>>>>> [2015-08-16 20:23:30.880037] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
>>>>>> [2015-08-16 20:23:31.038236] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
>>>>>> [2015-08-16 20:23:31.259762] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
>>>>>> [2015-08-16 20:23:31.333764] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
>>>>>> [2015-08-16 20:23:31.340190] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>>> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed
>>>>>>
>>>>>> Update: one of the rebalance tasks now failed.
>>>>>>
>>>>>> @Rafi, I got the same error as Friday except this time with data.
>>>>>>
>>>>>> Packets that carrying the ping request could be waiting in the queue
>>>>>> during
>>>>>> the whole time-out period, because of the heavy traffic in the network. I
>>>>>> have sent a patch for this. You can track the status here :
>>>>>> http://review.gluster.org/11935
>>>>>>
>>>>>>
>>>>>>
>>>>>> [2015-08-16 20:24:34.533167] C
>>>>>> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server
>>>>>> 192.168.123.104:49164 has not responded in the last 42 seconds,
>>>>>> disconnecting.
>>>>>> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/li
>>>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>>>> forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at
>>>>>> 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
>>>>>> [2015-08-16 20:24:34.533672] E [MSGID: 114031]
>>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>>>> operation failed [Transport endpoint is not connected]
>>>>>> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/li
>>>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>>>> forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at
>>>>>> 2015-08-16
>>>>>> 20:23:51.303938 (xid=0x5dd4d7)
>>>>>> [2015-08-16 20:24:34.534347] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>>> /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
>>>>>> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
>>>>>> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>>> called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8)
>>>>>> [2015-08-16 20:24:34.534579] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex:
>>>>>> failed to migrate data
>>>>>> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>>> called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db)
>>>>>> [2015-08-16 20:24:34.534745] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex:
>>>>>> failed to migrate data
>>>>>> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>>> called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc)
>>>>>> [2015-08-16 20:24:34.535232] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex:
>>>>>> failed to migrate data
>>>>>> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>>> called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd)
>>>>>> [2015-08-16 20:24:34.536069] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex:
>>>>>> failed to migrate data
>>>>>> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>>>> called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de)
>>>>>> [2015-08-16 20:24:34.536339] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex
>>>>>> lookup failed
>>>>>> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>>>> called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df)
>>>>>> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>>>> called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0)
>>>>>> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>>>> called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1)
>>>>>> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>>>> called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2)
>>>>>> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>>> (-->
>>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>>> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called
>>>>>> at
>>>>>> 2015-08-16 20:23:52.530107 (xid=0x5dd4e3)
>>>>>> [2015-08-16 20:24:34.538475] E [MSGID: 114031]
>>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>>>> operation failed [Transport endpoint is not connected]
>>>>>> The message "E [MSGID: 114031]
>>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk]
>>>>>> 0-live-client-0: remote operation failed [Transport endpoint is not
>>>>>> connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and
>>>>>> [2015-08-16 20:24:34.538535]
>>>>>> [2015-08-16 20:24:34.538584] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>>>> file failed: 002004003.flex lookup failed
>>>>>> [2015-08-16 20:24:34.538904] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>>>> file failed: 003009008.flex lookup failed
>>>>>> [2015-08-16 20:24:34.539724] E [MSGID: 109023]
>>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex
>>>>>> lookup failed
>>>>>> [2015-08-16 20:24:34.539820] E [MSGID: 109016]
>>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>>>> for /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)
>>>>>> [2015-08-16 20:24:34.540031] E [MSGID: 109016]
>>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>>>> for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1
>>>>>> [2015-08-16 20:24:34.540691] E [MSGID: 114031]
>>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex
>>>>>> [Transport endpoint is not connected]
>>>>>> [2015-08-16 20:24:34.541152] E [MSGID: 114031]
>>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex
>>>>>> [Transport endpoint is not connected]
>>>>>> [2015-08-16 20:24:34.541331] E [MSGID: 114031]
>>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex
>>>>>> [Transport endpoint is not connected]
>>>>>> [2015-08-16 20:24:34.541486] E [MSGID: 109016]
>>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>>>> for /hcs/hcs/OperaArchiveCol
>>>>>> [2015-08-16 20:24:34.541572] E [MSGID: 109016]
>>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>>>> for /hcs/hcs
>>>>>> [2015-08-16 20:24:34.541639] E [MSGID: 109016]
>>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>>>> for /hcs
>>>>>>
>>>>>> Any help would be greatly appreciated.
>>>>>> CCing dht teams to give you better idea about why rebalance failed/ and
>>>>>> about
>>>>>> huge memory consumption by rebalance process (200GB RAM) .
>>>>>>
>>>>>> Regards
>>>>>> Rafi KC
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Christophe
>>>>>>
>>>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>>>> Technical Specialist / Post-Doc
>>>>>>
>>>>>> UNIVERSITÉ DU LUXEMBOURG
>>>>>>
>>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>>>> Campus Belval | House of Biomedicine
>>>>>> 6, avenue du Swing
>>>>>> L-4367 Belvaux
>>>>>> T: +352 46 66 44 6124
>>>>>> F: +352 46 66 44 6949
>>>>>> http://www.uni.lu/lcsb
>>>>>>
>>>>>> ----
>>>>>> This message is confidential and may contain privileged information.
>>>>>> It is intended for the named recipient only.
>>>>>> If you receive it in error please notify me and permanently delete the
>>>>>> original message and any copies.
>>>>>> ----
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-devel mailing list
>>>>>> Gluster-devel@xxxxxxxxxxx<mailto:Gluster-devel@xxxxxxxxxxx>
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel@xxxxxxxxxxx
>> http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel