Re: How to diagnose volume rebalance failure?

PuYun <cloudor@xxxxxxx> · Tue, 15 Dec 2015 22:10:50 +0800

Hi,

I find this bug link https://bugzilla.redhat.com/show_bug.cgi?id=1261234 . My version is 3.7.4 which is older than the fixed version 3.7.5.
I'll upgrade my gluster version and try again later.

Thank you.

PuYun

From: PuYun
Date: 2015-12-15 21:58
To: gluster-users
Subject: Re:  How to diagnose volume rebalance failure?

Hi Susant,

I find errors in /var/log/messages. The process of glusterfs was killed because of OOM error. 

According to my previously posted logs, reblance task was stopped at 2015-12-14 21:46:53
================== FastVol-rebalance.log ============
...
[2015-12-14 21:46:53.423719] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.423976] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif: attempting to move from FastVol-client-0 to FastVol-client-1
 <EOF>
==============================================

My timezone is GMT+8:00, so the same local time is 2015-12-15 05:46:53. Here in /var/log/messages
================== /var/log/messages ===============
Dec 15 05:46:53 d001 kernel: QThread invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Dec 15 05:46:53 d001 kernel: QThread cpuset=/ mems_allowed=0
Dec 15 05:46:53 d001 kernel: Pid: 1656, comm: QThread Not tainted 2.6.32-431.23.3.el6.x86_64 #1
Dec 15 05:46:53 d001 kernel: Call Trace:
Dec 15 05:46:53 d001 kernel: [<ffffffff810d0431>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Dec 15 05:46:53 d001 kernel: [<ffffffff81122810>] ? dump_header+0x90/0x1b0
Dec 15 05:46:53 d001 kernel: [<ffffffff8122833c>] ? security_real_capable_noaudit+0x3c/0x70
Dec 15 05:46:53 d001 kernel: [<ffffffff81122c92>] ? oom_kill_process+0x82/0x2a0
Dec 15 05:46:53 d001 kernel: [<ffffffff81122bd1>] ? select_bad_process+0xe1/0x120
Dec 15 05:46:53 d001 kernel: [<ffffffff811230d0>] ? out_of_memory+0x220/0x3c0
Dec 15 05:46:53 d001 kernel: [<ffffffff8112f9ef>] ? __alloc_pages_nodemask+0x89f/0x8d0
Dec 15 05:46:53 d001 kernel: [<ffffffff811678ea>] ? alloc_pages_current+0xaa/0x110
Dec 15 05:46:53 d001 kernel: [<ffffffff8111fc07>] ? __page_cache_alloc+0x87/0x90
Dec 15 05:46:53 d001 kernel: [<ffffffff8111f5ee>] ? find_get_page+0x1e/0xa0
Dec 15 05:46:53 d001 kernel: [<ffffffff81120ba7>] ? filemap_fault+0x1a7/0x500
Dec 15 05:46:53 d001 kernel: [<ffffffff81149ed4>] ? __do_fault+0x54/0x530
Dec 15 05:46:53 d001 kernel: [<ffffffff8114a4a7>] ? handle_pte_fault+0xf7/0xb00
Dec 15 05:46:53 d001 kernel: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
Dec 15 05:46:53 d001 kernel: [<ffffffff8114b0da>] ? handle_mm_fault+0x22a/0x300
Dec 15 05:46:53 d001 kernel: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
Dec 15 05:46:53 d001 kernel: [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
Dec 15 05:46:53 d001 kernel: [<ffffffff810a01d2>] ? hrtimer_cancel+0x22/0x30
Dec 15 05:46:53 d001 kernel: [<ffffffff8152a7e3>] ? do_nanosleep+0x93/0xc0
Dec 15 05:46:53 d001 kernel: [<ffffffff810a02a4>] ? hrtimer_nanosleep+0xc4/0x180
Dec 15 05:46:53 d001 kernel: [<ffffffff8109f100>] ? hrtimer_wakeup+0x0/0x30
Dec 15 05:46:53 d001 kernel: [<ffffffff8152e74e>] ? do_page_fault+0x3e/0xa0
Dec 15 05:46:53 d001 kernel: [<ffffffff8152bb05>] ? page_fault+0x25/0x30
Dec 15 05:46:53 d001 kernel: Mem-Info:
Dec 15 05:46:53 d001 kernel: Node 0 DMA per-cpu:
Dec 15 05:46:53 d001 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    1: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    2: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    3: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    4: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    5: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    6: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    7: hi:    0, btch:   1 usd:   0
Dec 15 05:46:53 d001 kernel: Node 0 DMA32 per-cpu:
Dec 15 05:46:53 d001 kernel: CPU    0: hi:  186, btch:  31 usd: 199
Dec 15 05:46:53 d001 kernel: CPU    1: hi:  186, btch:  31 usd:  58
Dec 15 05:46:53 d001 kernel: CPU    2: hi:  186, btch:  31 usd:  24
Dec 15 05:46:53 d001 kernel: CPU    3: hi:  186, btch:  31 usd:  75
Dec 15 05:46:53 d001 kernel: CPU    4: hi:  186, btch:  31 usd: 156
Dec 15 05:46:53 d001 kernel: CPU    5: hi:  186, btch:  31 usd:  60
Dec 15 05:46:53 d001 kernel: CPU    6: hi:  186, btch:  31 usd:  32
Dec 15 05:46:53 d001 kernel: CPU    7: hi:  186, btch:  31 usd:   7
Dec 15 05:46:53 d001 kernel: Node 0 Normal per-cpu:
Dec 15 05:46:53 d001 kernel: CPU    0: hi:  186, btch:  31 usd: 155
Dec 15 05:46:53 d001 kernel: CPU    1: hi:  186, btch:  31 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    2: hi:  186, btch:  31 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    3: hi:  186, btch:  31 usd:  40
Dec 15 05:46:53 d001 kernel: CPU    4: hi:  186, btch:  31 usd: 167
Dec 15 05:46:53 d001 kernel: CPU    5: hi:  186, btch:  31 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    6: hi:  186, btch:  31 usd:   0
Dec 15 05:46:53 d001 kernel: CPU    7: hi:  186, btch:  31 usd:   2
Dec 15 05:46:53 d001 kernel: active_anon:1955292 inactive_anon:38 isolated_anon:0
Dec 15 05:46:53 d001 kernel: active_file:358 inactive_file:1728 isolated_file:0
Dec 15 05:46:53 d001 kernel: unevictable:0 dirty:14 writeback:0 unstable:0
Dec 15 05:46:53 d001 kernel: free:25714 slab_reclaimable:2453 slab_unreclaimable:7498
Dec 15 05:46:53 d001 kernel: mapped:39 shmem:43 pagetables:4664 bounce:0
Dec 15 05:46:53 d001 kernel: Node 0 DMA free:15752kB min:124kB low:152kB high:184kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 3000 8050 8050
Dec 15 05:46:53 d001 kernel: Node 0 DMA32 free:44860kB min:25140kB low:31424kB high:37708kB active_anon:2737816kB inactive_anon:0kB active_file:732kB inactive_file:5856kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072096kB mlocked:0kB dirty:32kB writeback:0kB mapped:160kB shmem:0kB slab_reclaimable:1696kB slab_unreclaimable:1352kB kernel_stack:72kB pagetables:5460kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2447 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 0 5050 5050
Dec 15 05:46:53 d001 kernel: Node 0 Normal free:42244kB min:42316kB low:52892kB high:63472kB active_anon:5083352kB inactive_anon:152kB active_file:700kB inactive_file:1056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:24kB writeback:4kB mapped:0kB shmem:172kB slab_reclaimable:8116kB slab_unreclaimable:28640kB kernel_stack:2176kB pagetables:13196kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2817 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 0 0 0
Dec 15 05:46:53 d001 kernel: Node 0 DMA: 2*4kB 2*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15752kB
Dec 15 05:46:53 d001 kernel: Node 0 DMA32: 11186*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB
Dec 15 05:46:53 d001 kernel: Node 0 Normal: 10729*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42940kB
Dec 15 05:46:53 d001 kernel: 2634 total pagecache pages
Dec 15 05:46:53 d001 kernel: 0 pages in swap cache
Dec 15 05:46:53 d001 kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 15 05:46:53 d001 kernel: Free swap  = 0kB
Dec 15 05:46:53 d001 kernel: Total swap = 0kB
Dec 15 05:46:53 d001 kernel: 2097151 pages RAM
Dec 15 05:46:53 d001 kernel: 81926 pages reserved
Dec 15 05:46:53 d001 kernel: 1752 pages shared
Dec 15 05:46:53 d001 kernel: 1983480 pages non-shared
Dec 15 05:46:53 d001 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Dec 15 05:46:53 d001 kernel: [  477]     0   477     2662      105   0     -17         -1000 udevd
Dec 15 05:46:53 d001 kernel: [  916]     0   916      374       21   0       0             0 aliyun-service
Dec 15 05:46:53 d001 kernel: [ 1156]     0  1156    62798      149   0       0             0 rsyslogd
Dec 15 05:46:53 d001 kernel: [ 1178]    32  1178     4744       62   0       0             0 rpcbind
Dec 15 05:46:53 d001 kernel: [ 1198]    29  1198     5837      112   1       0             0 rpc.statd
Dec 15 05:46:53 d001 kernel: [ 1226]     0  1226   201835     4673   0       0             0 glusterd
Dec 15 05:46:53 d001 kernel: [ 1382]    28  1382   157544      115   0       0             0 nscd
Dec 15 05:46:53 d001 kernel: [ 1414]     0  1414   118751      673   1       0             0 AliYunDunUpdate
Dec 15 05:46:53 d001 kernel: [ 1448]     0  1448    16657      178   4     -17         -1000 sshd
Dec 15 05:46:53 d001 kernel: [ 1463]    38  1463     6683      151   0       0             0 ntpd
Dec 15 05:46:53 d001 kernel: [ 1473]     0  1473    29325      154   1       0             0 crond
Dec 15 05:46:53 d001 kernel: [ 1516]     0  1516     1016       19   1       0             0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1518]     0  1518     1016       17   3       0             0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1520]     0  1520     1016       18   5       0             0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1522]     0  1522     2661      104   4     -17         -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 1523]     0  1523     2661      104   4     -17         -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 1524]     0  1524     1016       18   2       0             0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1526]     0  1526     1016       19   4       0             0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1528]     0  1528     1016       19   1       0             0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1581]     0  1581   362551    34027   0       0             0 glusterfsd
Dec 15 05:46:53 d001 kernel: [ 1599]     0  1599   919605   653045   0       0             0 glusterfsd
Dec 15 05:46:53 d001 kernel: [ 1652]     0  1652   191799     1125   5       0             0 AliYunDun
Dec 15 05:46:53 d001 kernel: [ 1670]     0  1670   230065      974   0       0             0 AliHids
Dec 15 05:46:53 d001 kernel: [ 1800]     0  1800  1455594  1259392   0       0             0 glusterfs
Dec 15 05:46:53 d001 kernel: [ 2238]     0  2238     4872       55   1       0             0 anacron
Dec 15 05:46:53 d001 kernel: [ 2242]     0  2242    26524       43   1       0             0 run-parts
Dec 15 05:46:53 d001 kernel: [ 2413]     0  2413    26524       39   5       0             0 mlocate.cron
Dec 15 05:46:53 d001 kernel: [ 2414]     0  2414    26488       35   4       0             0 awk
Dec 15 05:46:53 d001 kernel: [ 2419]     0  2419    25272      117   2       0             0 updatedb
Dec 15 05:46:53 d001 kernel: Out of memory: Kill process 1800 (glusterfs) score 596 or sacrifice child
Dec 15 05:46:53 d001 kernel: Killed process 1800, UID 0, (glusterfs) total-vm:5822376kB, anon-rss:5037552kB, file-rss:16kB
==============================================

However, I can find oom in /var/log/messages matching every rebalance failure before. Is it a memory leak bug?

Thank you.

PuYun

From: PuYun
Date: 2015-12-15 21:08
To: gluster-users
Subject: Re:  How to diagnose volume rebalance failure?

Hi Susant,
    Thank you for your advice.  I have set that parameter and started again in lazy mode. I think maybe it will not fail again this time because every time it fails it happens after migration started. Another important thing is fix-layout rebalance which is without migration can complete after running for 25 hours.

    And, I can't grep any errors in the rebalance log file using "grep ' E '  FastVol-rebalance.log".

    Thank you.

PuYun

From: Susant Palai
Date: 2015-12-15 15:21
To: PuYun
CC: gluster-users
Subject: Re:  How to diagnose volume rebalance failure?
Hi PuYun,
  We need to figure out some mechanism to get the huge log files. Until then here is something I can think can be reason that can affect the performance.

The rebalance normally starts in medium level [performance wise] which means for you in this case will generate two threads for migration which can hog on those 2 cores. In case you run rebalance again, run it in lazy mode. Here is the command.

"gluster v set <VOLUME-NAME> rebal-throttle lazy". This should spawn just one thread for migration.

For logs: Can you grep for errors in the rebalance log file and upload? <till we figure out a method to get full logs>

Thanks,
Susant

----- Original Message -----
From: "PuYun" <cloudor@xxxxxxx>
To: "gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Tuesday, 15 December, 2015 5:51:00 AM
Subject: Re:  How to diagnose volume rebalance failure?

Hi, 

Another weird piece of infomation may be useful. The failed task had actually been running for hours, but the status command output only 3621 sec. 

============== shell ============== 
[root@d001 glusterfs]# gluster volume rebalance FastVol status 
Node Rebalanced-files size scanned failures skipped status run time in secs 
--------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 
localhost 0 0Bytes 952767 0 0 failed 3621.00 
volume rebalance: FastVol: success: 

================================ 

As you can see, I started rebalance task for only 1 time. 
======== cmd_history.log-20151215 ====== 
[2015-12-14 12:50:41.443937] : volume start FastVol : SUCCESS 
[2015-12-14 12:55:01.367519] : volume rebalance FastVol start : SUCCESS 
[2015-12-14 13:55:22.132199] : volume rebalance FastVol status : SUCCESS 
[2015-12-14 23:04:01.780885] : volume rebalance FastVol status : SUCCESS 
[2015-12-14 23:35:56.708077] : volume rebalance FastVol status : SUCCESS 

================================= 

Because the task failed at [ 2015-12-14 21:46:54.179xx], something wrong might happened at 3621 secs before, that is [ 2015-12-14 20:46:33.179xx]. I check logs at that time, found nothing special. 
========== FastVol-rebalance.log ======== 
[2015-12-14 20:46:33.166748] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.171009] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/t2/n1/VSXZlm65KjfhbgoM/flag_finished from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.174851] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.181448] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.184996] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_small.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.191681] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 20:46:33.195396] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_big_square.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 

================================== 

And, there is no logs around at [ 2015-12-14 20:46:33.179xx ] in mnt-b1-brick.log, mnt-c1-brick.log and etc-glusterfs-glusterd.vol.log. 

PuYun 

From: PuYun 
Date: 2015-12-15 07:30 
To: gluster-users 
Subject: Re:  How to diagnose volume rebalance failure? 

Hi, 

Failed again. I can see disconnections in logs, but no more details. 

=========== mnt-b1-brick.log =========== 
[2015-12-14 21:46:54.179662] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0 
[2015-12-14 21:46:54.181764] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on / 
[2015-12-14 21:46:54.181815] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir 
[2015-12-14 21:46:54.181856] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user 
[2015-12-14 21:46:54.181918] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg 
[2015-12-14 21:46:54.181961] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an 
[2015-12-14 21:46:54.182003] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif 
[2015-12-14 21:46:54.182036] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji 
[2015-12-14 21:46:54.182076] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay 
[2015-12-14 21:46:54.182110] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00 
[2015-12-14 21:46:54.182203] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0 

====================================== 

============== mnt-c1-brick.log -============ 
[2015-12-14 21:46:54.179597] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0 
[2015-12-14 21:46:54.180428] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 5e300cdb-7298-44c0-90eb-5b50018daed6 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff} 
[2015-12-14 21:46:54.180454] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 3c9a1cd5-84c8-4967-98d5-e75a402b1f74 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff} 
[2015-12-14 21:46:54.180483] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on / 
[2015-12-14 21:46:54.180525] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir 
[2015-12-14 21:46:54.180570] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user 
[2015-12-14 21:46:54.180604] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg 
[2015-12-14 21:46:54.180634] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji 
[2015-12-14 21:46:54.180678] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay 
[2015-12-14 21:46:54.180725] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00 
[2015-12-14 21:46:54.180779] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif 
[2015-12-14 21:46:54.180820] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an 
[2015-12-14 21:46:54.180859] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0 

====================================== 

============== etc-glusterfs-glusterd.vol.log ========== 
[2015-12-14 21:46:54.179819] W [socket.c:588:__socket_rwv] 0-management: readv on /var/run/gluster/gluster-rebalance-dbee250a-e3fe-4448-b905-b76c5ba80b25.sock failed (No data available) 
[2015-12-14 21:46:54.209586] I [MSGID: 106007] [glusterd-rebalance.c:162:__glusterd_defrag_notify] 0-management: Rebalance process for volume FastVol has disconnected. 
[2015-12-14 21:46:54.209627] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=588 max=1 total=1 
[2015-12-14 21:46:54.209640] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=124 max=1 total=1 

============================================= 

================== FastVol-rebalance.log ============ 
... 
[2015-12-14 21:46:53.423719] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 21:46:53.423976] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html from subvolume FastVol-client-0 to FastVol-client-1 
[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: attempting to move from FastVol-client-0 to FastVol-client-1 
[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif: attempting to move from FastVol-client-0 to FastVol-client-1 

<EOF> 
============================================== 

PuYun 

From: PuYun 
Date: 2015-12-14 21:51 
To: gluster-users 
Subject: Re:  How to diagnose volume rebalance failure? 

Hi, 

Thank you for your reply. I don't know how to send you the huge sized rebalance log file which is about 2GB. 

However, I might have found out the reason why the task failed. My gluster server has only 2 cpu cores and carries 2 ssd bricks. When the rebalance task began, top 3 processes are 70%~80%, 30%~40 and 30%~40 cpu usage. Others are less than 1%. But after a while, 2 CPU cores are used up totally and I even can't login until the rebalance task failed. 

It seems 2 bricks require 4 CPU cores at least. Now I upgrade the virtual server with 8 CPU cores and start rebalance task again. Everything goes well for now. 

I will report again when the current task completed or failed. 

PuYun 

From: Nithya Balachandran 
Date: 2015-12-14 18:57 
To: PuYun 
CC: gluster-users 
Subject: Re:  How to diagnose volume rebalance failure? 

Hi, 

Can you send us the rebalance log? 

Regards, 
Nithya 

----- Original Message ----- 
> From: "PuYun" <cloudor@xxxxxxx> 
> To: "gluster-users" <gluster-users@xxxxxxxxxxx> 
> Sent: Monday, December 14, 2015 11:33:40 AM 
> Subject: Re:  How to diagnose volume rebalance failure? 
> 
> Here is the tail of the failed rebalance log, any clue? 
> 
> [2015-12-13 21:30:31.527493] I [dht-rebalance.c:2340:gf_defrag_process_dir] 
> 0-FastVol-dht: Migration operation on dir 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/5F/1MsH5--BcoGRAJPI took 20.95 secs 
> [2015-12-13 21:30:31.528704] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:30:31.543901] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:31:37.210496] I [MSGID: 109081] 
> [dht-common.c:3780:dht_setxattr] 0-FastVol-dht: fixing the layout of 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q 
> [2015-12-13 21:31:37.722825] I [MSGID: 109045] 
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 0 
> (FastVol-client-0): 1032124 chunks 
> [2015-12-13 21:31:37.722837] I [MSGID: 109045] 
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 1 
> (FastVol-client-1): 1032124 chunks 
> [2015-12-13 21:33:03.955539] I [MSGID: 109064] 
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol: 
> FastVol-client-0; inode layout - 0 - 2146817919 - 1; disk layout - 
> 2146817920 - 4294967295 - 1 
> [2015-12-13 21:33:04.069859] I [MSGID: 109018] 
> [dht-common.c:806:dht_revalidate_cbk] 0-FastVol-dht: Mismatching layouts for 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q, gfid = 
> f38c4ed2-a26a-4d83-adfd-6b0331831738 
> [2015-12-13 21:33:04.118800] I [MSGID: 109064] 
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol: 
> FastVol-client-1; inode layout - 2146817920 - 4294967295 - 1; disk layout - 
> 0 - 2146817919 - 1 
> [2015-12-13 21:33:19.979507] I [MSGID: 109022] 
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration 
> of 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished 
> from subvolume FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:33:19.979459] I [MSGID: 109022] 
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration 
> of /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint 
> from subvolume FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:33:25.543941] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_origin.jpg: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> [2015-12-13 21:33:25.962547] I [dht-rebalance.c:1010:dht_migrate_file] 
> 0-FastVol-dht: 
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_small.jpg: 
> attempting to move from FastVol-client-0 to FastVol-client-1 
> 
> 
> Cloudor 
> 
> 
> 
> From: Sakshi Bansal 
> Date: 2015-12-12 13:02 
> To: 蒲云 
> CC: gluster-users 
> Subject: Re:  How to diagnose volume rebalance failure? 
> In the rebalance log file you can check the file/directory for which the 
> rebalance has failed. It can mention what was the fop for whihc the failure 
> happened. 
> 
> _______________________________________________ 
> Gluster-users mailing list 
> Gluster-users@xxxxxxxxxxx 
> http://www.gluster.org/mailman/listinfo/gluster-users 
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users