Re: Skipped files during rebalance

Christophe TREFOIS <christophe.trefois@xxxxxx> · Thu, 20 Aug 2015 19:09:05 +0000

Dear Susant,

The rebalance failed again and also had (in my opinion) excessive RAM usage.

Please find a very detailled list below.

All logs:

http://wikisend.com/download/651948/allstores.tar.gz

Thank you for letting me know how I could successfully complete the rebalance process.
The fedora pastes are the output of top of each node at that time (more or less).

Please let me know if you need more information,

Best,

—— Start of mem info

# After reboot, before starting glusterd

[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249        2208      190825           9         215      190772
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248        2275      190738           9         234      190681
stor105: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249        2221      190811           9         216      190757
stor104: Swap:             0           0           0
[root@highlander ~]#

# Gluster Info

[root@stor106 glusterfs]# gluster volume info

Volume Name: live
Type: Distribute
Volume ID: 1328637d-7730-4627-8945-bbe43626d527
Status: Started
Number of Bricks: 9
Transport-type: tcp
Bricks:
Brick1: stor104:/zfs/brick0/brick
Brick2: stor104:/zfs/brick1/brick
Brick3: stor104:/zfs/brick2/brick
Brick4: stor106:/zfs/brick0/brick
Brick5: stor106:/zfs/brick1/brick
Brick6: stor106:/zfs/brick2/brick
Brick7: stor105:/zfs/brick0/brick
Brick8: stor105:/zfs/brick1/brick
Brick9: stor105:/zfs/brick2/brick
Options Reconfigured:
nfs.disable: true
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
performance.client-io-threads: on
performance.cache-size: 1GB
performance.cache-refresh-timeout: 60
performance.cache-max-file-size: 4MB
cluster.data-self-heal-algorithm: full
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
cluster.min-free-disk: 1%
server.allow-insecure: on

# Starting gluserd

[root@highlander ~]# pdsh -g live 'systemctl start glusterd'
[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249        2290      190569           9         389      190587
stor106: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249        2297      190557           9         394      190571
stor104: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248        2286      190554           9         407      190595
stor105: Swap:             0           0           0

[root@highlander ~]# systemctl start glusterd
[root@highlander ~]# gluster volume start live
volume start: live: success
[root@highlander ~]# gluster volume status
Status of volume: live
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick stor104:/zfs/brick0/brick             49164     0          Y       5945
Brick stor104:/zfs/brick1/brick             49165     0          Y       5963
Brick stor104:/zfs/brick2/brick             49166     0          Y       5981
Brick stor106:/zfs/brick0/brick             49158     0          Y       5256
Brick stor106:/zfs/brick1/brick             49159     0          Y       5274
Brick stor106:/zfs/brick2/brick             49160     0          Y       5292
Brick stor105:/zfs/brick0/brick             49155     0          Y       5284
Brick stor105:/zfs/brick1/brick             49156     0          Y       5302
Brick stor105:/zfs/brick2/brick             49157     0          Y       5320
NFS Server on localhost                     N/A       N/A        N       N/A
NFS Server on 192.168.123.106               N/A       N/A        N       N/A
NFS Server on stor105                       N/A       N/A        N       N/A
NFS Server on 192.168.123.104               N/A       N/A        N       N/A

Task Status of Volume live
------------------------------------------------------------------------------
There are no active volume tasks

[root@highlander ~]#

# Memory usage of each node after 5 minutes

Output of top:

pdsh -g live 'top -n 1 -b' | fpaste

http://paste.fedoraproject.org/256710/14399886/

[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249        6877      184154           9        2218      184250
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248       22126      169351           9        1771      169403
stor105: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249        2708      188638           9        1902      188687
stor104: Swap:             0           0           0

# Memory usage of each node after 45 minutes

[root@highlander ~]# pdsh -g live 'free -m'
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249        3131      184168           9        5949      184524
stor104: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249       27919      158176           9        7153      158894
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248      117096       70621           9        5530       70891
stor105: Swap:             0           0           0

http://paste.fedoraproject.org/256726/43999119

# Memory usage of each node after 90 minutes

[root@highlander ~]# pdsh -g live 'free -m'
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249        3390      181034           9        8825      181661
stor104: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249       45780      136424           9       11044      137759
stor106: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248      151483       33492           9        8272       33972
stor105: Swap:             0           0           0

http://paste.fedoraproject.org/256745/14399937

# Memory usage after 5 hours

```bash
[root@highlander ~]# pdsh -g live 'free -m'
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249        4645      163186           9       25417      165473
stor104: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248      155094       14784           9       23369       16640
stor105: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249      141379       16515           9       35355       23714
stor106: Swap:             0           0           0
```

http://paste.fedoraproject.org/256879/44001235

# Memory usage after 6 hours

```bash
[root@highlander ~]# pdsh -g live 'free -m'
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249      140526       12207           9       40516       21612
stor106: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249      102923       58748           9       31578       63632
stor104: Swap:             0           0           0
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248      155394       10876           9       26977       13154
stor105: Swap:             0           0           0
```

http://paste.fedoraproject.org/256905/00168781

# Memory after 24 hours + Failed

```bash
[root@highlander ~]# pdsh -g live 'free -m'
stor105:               total        used        free      shared  buff/cache   available
stor105: Mem:         193248      136123        6323           9       50801       10281
stor105: Swap:             0           0           0
stor104:               total        used        free      shared  buff/cache   available
stor104: Mem:         193249      125320        2812           9       65116       17337
stor104: Swap:             0           0           0
stor106:               total        used        free      shared  buff/cache   available
stor106: Mem:         193249      111997       13969           9       67282       19429
stor106: Swap:             0           0           0
[root@highlander ~]#
```

 http://paste.fedoraproject.org/257254/14400880

# Failed logs

```bash
[root@highlander ~]# gluster volume rebalance live status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                         192.168.123.104           748812         4.4TB       4160456          1311        156772               failed           63114.00
                         192.168.123.106          1187917         3.3TB       6021931         21625       1209503               failed           75243.00
                                 stor105                0        0Bytes       2440431            16           196               failed           69658.00
volume rebalance: live: success:
```

Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  
T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb

----
This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original message and any copies. 
----

> On 19 Aug 2015, at 08:14, Susant Palai <spalai@xxxxxxxxxx> wrote:
> 
> Comments inline.
> 
> ----- Original Message -----
>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx>
>> To: "Susant Palai" <spalai@xxxxxxxxxx>
>> Cc: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, "Shyamsundar
>> Ranganathan" <srangana@xxxxxxxxxx>, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel"
>> <gluster-devel@xxxxxxxxxxx>
>> Sent: Tuesday, August 18, 2015 8:08:41 PM
>> Subject: Re:  Skipped files during rebalance
>> 
>> Hi Susan,
>> 
>> Thank you for the response.
>> 
>>> On 18 Aug 2015, at 10:45, Susant Palai <spalai@xxxxxxxxxx> wrote:
>>> 
>>> Hi Christophe,
>>> 
>>>  Need some info regarding the high mem-usage.
>>> 
>>> 1. Top output: To see whether any other process eating up memory.
> 
> I will be interested to know the memory usage of all the gluster process referring to the high mem-usage. These process includes glusterfsd, glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs).
> 
> 
>>> 2. Gluster volume info
>> 
>> root@highlander ~]# gluster volume info
>> 
>> Volume Name: live
>> Type: Distribute
>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>> Status: Started
>> Number of Bricks: 9
>> Transport-type: tcp
>> Bricks:
>> Brick1: stor104:/zfs/brick0/brick
>> Brick2: stor104:/zfs/brick1/brick
>> Brick3: stor104:/zfs/brick2/brick
>> Brick4: stor106:/zfs/brick0/brick
>> Brick5: stor106:/zfs/brick1/brick
>> Brick6: stor106:/zfs/brick2/brick
>> Brick7: stor105:/zfs/brick0/brick
>> Brick8: stor105:/zfs/brick1/brick
>> Brick9: stor105:/zfs/brick2/brick
>> Options Reconfigured:
>> diagnostics.count-fop-hits: on
>> diagnostics.latency-measurement: on
>> server.allow-insecure: on
>> cluster.min-free-disk: 1%
>> diagnostics.brick-log-level: ERROR
>> diagnostics.client-log-level: ERROR
>> cluster.data-self-heal-algorithm: full
>> performance.cache-max-file-size: 4MB
>> performance.cache-refresh-timeout: 60
>> performance.cache-size: 1GB
>> performance.client-io-threads: on
>> performance.io-thread-count: 32
>> performance.write-behind-window-size: 4MB
>> 
>>> 3. Is rebalance process still running? If yes can you point to specific mem
>>> usage by rebalance process? The high mem-usage was seen during rebalance
>>> or even post rebalance?
>> 
>> I would like to restart the rebalance process since it failed… But I can’t as
>> the volume cannot be stopped (I wanted to reboot the servers to have a clean
>> testing grounds).
>> 
>> Here are the logs from the three nodes:
>> http://paste.fedoraproject.org/256183/43989079
>> 
>> Maybe you could help me figure out how to stop the volume?
>> 
>> This is what happens
>> 
>> [root@highlander ~]# gluster volume rebalance live stop
>> volume rebalance: live: failed: Rebalance not started.
> 
> Requesting glusterd team to give input. 
>> 
>> [root@highlander ~]# ssh stor105 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# ssh stor104 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# ssh stor106 "gluster volume rebalance live stop"
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# gluster volume rebalance live stop
>> volume rebalance: live: failed: Rebalance not started.
>> 
>> [root@highlander ~]# gluster volume stop live
>> Stopping volume will make its data inaccessible. Do you want to continue?
>> (y/n) y
>> volume stop: live: failed: Staging failed on stor106. Error: rebalance
>> session is in progress for the volume 'live'
>> Staging failed on stor104. Error: rebalance session is in progress for the
>> volume ‘live'
> Can you run [ps aux |  grep "rebalance"] on all the servers and post here? Just want to check whether rebalance is really running or not. Again requesting glusterd team to give inputs.
> 
>> 
>> 
>>> 4. Gluster version
>> 
>> [root@highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
>> stor104: glusterfs-api-3.7.3-1.el7.x86_64
>> stor104: glusterfs-server-3.7.3-1.el7.x86_64
>> stor104: glusterfs-libs-3.7.3-1.el7.x86_64
>> stor104: glusterfs-3.7.3-1.el7.x86_64
>> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
>> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>> stor104: glusterfs-cli-3.7.3-1.el7.x86_64
>> 
>> stor105: glusterfs-3.7.3-1.el7.x86_64
>> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>> stor105: glusterfs-api-3.7.3-1.el7.x86_64
>> stor105: glusterfs-cli-3.7.3-1.el7.x86_64
>> stor105: glusterfs-server-3.7.3-1.el7.x86_64
>> stor105: glusterfs-libs-3.7.3-1.el7.x86_64
>> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
>> 
>> stor106: glusterfs-libs-3.7.3-1.el7.x86_64
>> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
>> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>> stor106: glusterfs-api-3.7.3-1.el7.x86_64
>> stor106: glusterfs-cli-3.7.3-1.el7.x86_64
>> stor106: glusterfs-server-3.7.3-1.el7.x86_64
>> stor106: glusterfs-3.7.3-1.el7.x86_64
>> 
>>> 
>>> Will ask for more information in case needed.
>>> 
>>> Regards,
>>> Susant
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Christophe TREFOIS" <christophe.trefois@xxxxxx>
>>>> To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Nithya Balachandran"
>>>> <nbalacha@xxxxxxxxxx>, "Susant Palai"
>>>> <spalai@xxxxxxxxxx>, "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx>
>>>> Cc: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>
>>>> Sent: Monday, 17 August, 2015 7:03:20 PM
>>>> Subject: Fwd:  Skipped files during rebalance
>>>> 
>>>> Hi DHT team,
>>>> 
>>>> This email somehow didn’t get forwarded to you.
>>>> 
>>>> In addition to my problem described below, here is one example of free
>>>> memory
>>>> after everything failed
>>>> 
>>>> [root@highlander ~]# pdsh -g live 'free -m'
>>>> stor106:               total        used        free      shared
>>>> buff/cache
>>>> available
>>>> stor106: Mem:         193249      124784        1347           9
>>>> 67118
>>>> 12769
>>>> stor106: Swap:             0           0           0
>>>> stor104:               total        used        free      shared
>>>> buff/cache
>>>> available
>>>> stor104: Mem:         193249      107617       31323           9
>>>> 54308
>>>> 42752
>>>> stor104: Swap:             0           0           0
>>>> stor105:               total        used        free      shared
>>>> buff/cache
>>>> available
>>>> stor105: Mem:         193248      141804        6736           9
>>>> 44707
>>>> 9713
>>>> stor105: Swap:             0           0           0
>>>> 
>>>> So after the failed operation, there’s almost no memory free, and it is
>>>> also
>>>> not freed up.
>>>> 
>>>> Thank you for pointing me to any directions,
>>>> 
>>>> Kind regards,
>>>> 
>>>> —
>>>> Christophe
>>>> 
>>>> 
>>>> Begin forwarded message:
>>>> 
>>>> From: Christophe TREFOIS
>>>> <christophe.trefois@xxxxxx<mailto:christophe.trefois@xxxxxx>>
>>>> Subject: Re:  Skipped files during rebalance
>>>> Date: 17 Aug 2015 11:54:32 CEST
>>>> To: Mohammed Rafi K C <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>>
>>>> Cc: "gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>"
>>>> <gluster-devel@xxxxxxxxxxx<mailto:gluster-devel@xxxxxxxxxxx>>
>>>> 
>>>> Dear Rafi,
>>>> 
>>>> Thanks for submitting a patch.
>>>> 
>>>> @DHT, I have two additional questions / problems.
>>>> 
>>>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes
>>>> dramatically high, eg out of 196 GB available per node, RAM usage would
>>>> fill
>>>> up to 195.6 GB. This seems quite excessive and strange to me.
>>>> 
>>>> 2. As you can see, the rebalance (with data) failed as one endpoint
>>>> becomes
>>>> unconnected (even though it still is connected). I’m thinking this could
>>>> be
>>>> due to the high RAM usage?
>>>> 
>>>> Thank you for your help,
>>>> 
>>>> —
>>>> Christophe
>>>> 
>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>> Technical Specialist / Post-Doc
>>>> 
>>>> UNIVERSITÉ DU LUXEMBOURG
>>>> 
>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>> Campus Belval | House of Biomedicine
>>>> 6, avenue du Swing
>>>> L-4367 Belvaux
>>>> T: +352 46 66 44 6124
>>>> F: +352 46 66 44 6949
>>>> http://www.uni.lu/lcsb
>>>> 
>>>> [Facebook]<https://www.facebook.com/trefex>  [Twitter]
>>>> <https://twitter.com/Trefex>   [Google Plus]
>>>> <https://plus.google.com/+ChristopheTrefois/>   [Linkedin]
>>>> <https://www.linkedin.com/in/trefoischristophe>   [skype]
>>>> <http://skype:Trefex?call>
>>>> 
>>>> 
>>>> ----
>>>> This message is confidential and may contain privileged information.
>>>> It is intended for the named recipient only.
>>>> If you receive it in error please notify me and permanently delete the
>>>> original message and any copies.
>>>> ----
>>>> 
>>>> 
>>>> 
>>>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C
>>>> <rkavunga@xxxxxxxxxx<mailto:rkavunga@xxxxxxxxxx>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
>>>> Dear all,
>>>> 
>>>> I have successfully added a new node to our setup, and finally managed to
>>>> get
>>>> a successful fix-layout run as well with no errors.
>>>> 
>>>> Now, as per the documentation, I started a gluster volume rebalance live
>>>> start task and I see many skipped files.
>>>> The error log contains then entires as follows for each skipped file.
>>>> 
>>>> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
>>>> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
>>>> [2015-08-16 20:23:30.804811] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
>>>> [2015-08-16 20:23:30.805201] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
>>>> [2015-08-16 20:23:30.880037] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
>>>> [2015-08-16 20:23:31.038236] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
>>>> [2015-08-16 20:23:31.259762] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
>>>> [2015-08-16 20:23:31.333764] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
>>>> [2015-08-16 20:23:31.340190] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed
>>>> 
>>>> Update: one of the rebalance tasks now failed.
>>>> 
>>>> @Rafi, I got the same error as Friday except this time with data.
>>>> 
>>>> Packets that carrying the ping request could be waiting in the queue
>>>> during
>>>> the whole time-out period, because of the heavy traffic in the network. I
>>>> have sent a patch for this. You can track the status here :
>>>> http://review.gluster.org/11935
>>>> 
>>>> 
>>>> 
>>>> [2015-08-16 20:24:34.533167] C
>>>> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server
>>>> 192.168.123.104:49164 has not responded in the last 42 seconds,
>>>> disconnecting.
>>>> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/li
>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>> forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at
>>>> 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
>>>> [2015-08-16 20:24:34.533672] E [MSGID: 114031]
>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>> operation failed [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/li
>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>> forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at
>>>> 2015-08-16
>>>> 20:23:51.303938 (xid=0x5dd4d7)
>>>> [2015-08-16 20:24:34.534347] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
>>>> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
>>>> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8)
>>>> [2015-08-16 20:24:34.534579] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db)
>>>> [2015-08-16 20:24:34.534745] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc)
>>>> [2015-08-16 20:24:34.535232] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>> called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd)
>>>> [2015-08-16 20:24:34.536069] E [MSGID: 109023]
>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>> /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex:
>>>> failed to migrate data
>>>> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de)
>>>> [2015-08-16 20:24:34.536339] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex
>>>> lookup failed
>>>> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df)
>>>> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0)
>>>> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1)
>>>> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
>>>> called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2)
>>>> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>> (-->
>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called
>>>> at
>>>> 2015-08-16 20:23:52.530107 (xid=0x5dd4e3)
>>>> [2015-08-16 20:24:34.538475] E [MSGID: 114031]
>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>> operation failed [Transport endpoint is not connected]
>>>> The message "E [MSGID: 114031]
>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk]
>>>> 0-live-client-0: remote operation failed [Transport endpoint is not
>>>> connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and
>>>> [2015-08-16 20:24:34.538535]
>>>> [2015-08-16 20:24:34.538584] E [MSGID: 109023]
>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>> file failed: 002004003.flex lookup failed
>>>> [2015-08-16 20:24:34.538904] E [MSGID: 109023]
>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>> file failed: 003009008.flex lookup failed
>>>> [2015-08-16 20:24:34.539724] E [MSGID: 109023]
>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex
>>>> lookup failed
>>>> [2015-08-16 20:24:34.539820] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)
>>>> [2015-08-16 20:24:34.540031] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1
>>>> [2015-08-16 20:24:34.540691] E [MSGID: 114031]
>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex
>>>> [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.541152] E [MSGID: 114031]
>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex
>>>> [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.541331] E [MSGID: 114031]
>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex
>>>> [Transport endpoint is not connected]
>>>> [2015-08-16 20:24:34.541486] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs/OperaArchiveCol
>>>> [2015-08-16 20:24:34.541572] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs/hcs
>>>> [2015-08-16 20:24:34.541639] E [MSGID: 109016]
>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed
>>>> for /hcs
>>>> 
>>>> Any help would be greatly appreciated.
>>>> CCing dht teams to give you better idea about why rebalance failed/ and
>>>> about
>>>> huge memory consumption by rebalance process (200GB RAM) .
>>>> 
>>>> Regards
>>>> Rafi KC
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> --
>>>> Christophe
>>>> 
>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>> Technical Specialist / Post-Doc
>>>> 
>>>> UNIVERSITÉ DU LUXEMBOURG
>>>> 
>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>> Campus Belval | House of Biomedicine
>>>> 6, avenue du Swing
>>>> L-4367 Belvaux
>>>> T: +352 46 66 44 6124
>>>> F: +352 46 66 44 6949
>>>> http://www.uni.lu/lcsb
>>>> 
>>>> ----
>>>> This message is confidential and may contain privileged information.
>>>> It is intended for the named recipient only.
>>>> If you receive it in error please notify me and permanently delete the
>>>> original message and any copies.
>>>> ----
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel@xxxxxxxxxxx<mailto:Gluster-devel@xxxxxxxxxxx>
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>> 
>> 

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel