Hi,
I find this bug link https://bugzilla.redhat.com/show_bug.cgi?id=1261234 . My version is 3.7.4 which is older than the fixed version 3.7.5.
I'll upgrade my gluster version and try again later.
Thank you.
PuYun
From: PuYunDate: 2015-12-15 21:58To: gluster-usersSubject: Re: How to diagnose volume rebalance failure?Hi Susant,I find errors in /var/log/messages. The process of glusterfs was killed because of OOM error.According to my previously posted logs, reblance task was stopped at 2015-12-14 21:46:53================== FastVol-rebalance.log ============...[2015-12-14 21:46:53.423719] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume FastVol-client-0 to FastVol-client-1[2015-12-14 21:46:53.423976] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html from subvolume FastVol-client-0 to FastVol-client-1[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: attempting to move from FastVol-client-0 to FastVol-client-1[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif: attempting to move from FastVol-client-0 to FastVol-client-1<EOF>==============================================My timezone is GMT+8:00, so the same local time is 2015-12-15 05:46:53. Here in /var/log/messages================== /var/log/messages ===============Dec 15 05:46:53 d001 kernel: QThread invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Dec 15 05:46:53 d001 kernel: QThread cpuset=/ mems_allowed=0
Dec 15 05:46:53 d001 kernel: Pid: 1656, comm: QThread Not tainted 2.6.32-431.23.3.el6.x86_64 #1
Dec 15 05:46:53 d001 kernel: Call Trace:
Dec 15 05:46:53 d001 kernel: [<ffffffff810d0431>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Dec 15 05:46:53 d001 kernel: [<ffffffff81122810>] ? dump_header+0x90/0x1b0
Dec 15 05:46:53 d001 kernel: [<ffffffff8122833c>] ? security_real_capable_noaudit+0x3c/0x70
Dec 15 05:46:53 d001 kernel: [<ffffffff81122c92>] ? oom_kill_process+0x82/0x2a0
Dec 15 05:46:53 d001 kernel: [<ffffffff81122bd1>] ? select_bad_process+0xe1/0x120
Dec 15 05:46:53 d001 kernel: [<ffffffff811230d0>] ? out_of_memory+0x220/0x3c0
Dec 15 05:46:53 d001 kernel: [<ffffffff8112f9ef>] ? __alloc_pages_nodemask+0x89f/0x8d0
Dec 15 05:46:53 d001 kernel: [<ffffffff811678ea>] ? alloc_pages_current+0xaa/0x110
Dec 15 05:46:53 d001 kernel: [<ffffffff8111fc07>] ? __page_cache_alloc+0x87/0x90
Dec 15 05:46:53 d001 kernel: [<ffffffff8111f5ee>] ? find_get_page+0x1e/0xa0
Dec 15 05:46:53 d001 kernel: [<ffffffff81120ba7>] ? filemap_fault+0x1a7/0x500
Dec 15 05:46:53 d001 kernel: [<ffffffff81149ed4>] ? __do_fault+0x54/0x530
Dec 15 05:46:53 d001 kernel: [<ffffffff8114a4a7>] ? handle_pte_fault+0xf7/0xb00
Dec 15 05:46:53 d001 kernel: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
Dec 15 05:46:53 d001 kernel: [<ffffffff8114b0da>] ? handle_mm_fault+0x22a/0x300
Dec 15 05:46:53 d001 kernel: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
Dec 15 05:46:53 d001 kernel: [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0Dec 15 05:46:53 d001 kernel: [<ffffffff810a01d2>] ? hrtimer_cancel+0x22/0x30
Dec 15 05:46:53 d001 kernel: [<ffffffff8152a7e3>] ? do_nanosleep+0x93/0xc0
Dec 15 05:46:53 d001 kernel: [<ffffffff810a02a4>] ? hrtimer_nanosleep+0xc4/0x180
Dec 15 05:46:53 d001 kernel: [<ffffffff8109f100>] ? hrtimer_wakeup+0x0/0x30
Dec 15 05:46:53 d001 kernel: [<ffffffff8152e74e>] ? do_page_fault+0x3e/0xa0
Dec 15 05:46:53 d001 kernel: [<ffffffff8152bb05>] ? page_fault+0x25/0x30
Dec 15 05:46:53 d001 kernel: Mem-Info:
Dec 15 05:46:53 d001 kernel: Node 0 DMA per-cpu:
Dec 15 05:46:53 d001 kernel: CPU 0: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 1: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 2: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 3: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 4: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 5: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 6: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 7: hi: 0, btch: 1 usd: 0
Dec 15 05:46:53 d001 kernel: Node 0 DMA32 per-cpu:
Dec 15 05:46:53 d001 kernel: CPU 0: hi: 186, btch: 31 usd: 199
Dec 15 05:46:53 d001 kernel: CPU 1: hi: 186, btch: 31 usd: 58
Dec 15 05:46:53 d001 kernel: CPU 2: hi: 186, btch: 31 usd: 24
Dec 15 05:46:53 d001 kernel: CPU 3: hi: 186, btch: 31 usd: 75
Dec 15 05:46:53 d001 kernel: CPU 4: hi: 186, btch: 31 usd: 156
Dec 15 05:46:53 d001 kernel: CPU 5: hi: 186, btch: 31 usd: 60
Dec 15 05:46:53 d001 kernel: CPU 6: hi: 186, btch: 31 usd: 32
Dec 15 05:46:53 d001 kernel: CPU 7: hi: 186, btch: 31 usd: 7
Dec 15 05:46:53 d001 kernel: Node 0 Normal per-cpu:
Dec 15 05:46:53 d001 kernel: CPU 0: hi: 186, btch: 31 usd: 155Dec 15 05:46:53 d001 kernel: CPU 1: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 2: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 3: hi: 186, btch: 31 usd: 40
Dec 15 05:46:53 d001 kernel: CPU 4: hi: 186, btch: 31 usd: 167
Dec 15 05:46:53 d001 kernel: CPU 5: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 6: hi: 186, btch: 31 usd: 0
Dec 15 05:46:53 d001 kernel: CPU 7: hi: 186, btch: 31 usd: 2
Dec 15 05:46:53 d001 kernel: active_anon:1955292 inactive_anon:38 isolated_anon:0
Dec 15 05:46:53 d001 kernel: active_file:358 inactive_file:1728 isolated_file:0
Dec 15 05:46:53 d001 kernel: unevictable:0 dirty:14 writeback:0 unstable:0
Dec 15 05:46:53 d001 kernel: free:25714 slab_reclaimable:2453 slab_unreclaimable:7498
Dec 15 05:46:53 d001 kernel: mapped:39 shmem:43 pagetables:4664 bounce:0
Dec 15 05:46:53 d001 kernel: Node 0 DMA free:15752kB min:124kB low:152kB high:184kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15364kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 3000 8050 8050
Dec 15 05:46:53 d001 kernel: Node 0 DMA32 free:44860kB min:25140kB low:31424kB high:37708kB active_anon:2737816kB inactive_anon:0kB active_file:732kB inactive_file:5856kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072096kB mlocked:0kB dirty:32kB writeback:0kB mapped:160kB shmem:0kB slab_reclaimable:1696kB slab_unreclaimable:1352kB kernel_stack:72kB pagetables:5460kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2447 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 0 5050 5050
Dec 15 05:46:53 d001 kernel: Node 0 Normal free:42244kB min:42316kB low:52892kB high:63472kB active_anon:5083352kB inactive_anon:152kB active_file:700kB inactive_file:1056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:24kB writeback:4kB mapped:0kB shmem:172kB slab_reclaimable:8116kB slab_unreclaimable:28640kB kernel_stack:2176kB pagetables:13196kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2817 all_unreclaimable? yes
Dec 15 05:46:53 d001 kernel: lowmem_reserve[]: 0 0 0 0Dec 15 05:46:53 d001 kernel: Node 0 DMA: 2*4kB 2*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15752kB
Dec 15 05:46:53 d001 kernel: Node 0 DMA32: 11186*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB
Dec 15 05:46:53 d001 kernel: Node 0 Normal: 10729*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42940kB
Dec 15 05:46:53 d001 kernel: 2634 total pagecache pages
Dec 15 05:46:53 d001 kernel: 0 pages in swap cache
Dec 15 05:46:53 d001 kernel: Swap cache stats: add 0, delete 0, find 0/0
Dec 15 05:46:53 d001 kernel: Free swap = 0kB
Dec 15 05:46:53 d001 kernel: Total swap = 0kB
Dec 15 05:46:53 d001 kernel: 2097151 pages RAM
Dec 15 05:46:53 d001 kernel: 81926 pages reserved
Dec 15 05:46:53 d001 kernel: 1752 pages shared
Dec 15 05:46:53 d001 kernel: 1983480 pages non-shared
Dec 15 05:46:53 d001 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Dec 15 05:46:53 d001 kernel: [ 477] 0 477 2662 105 0 -17 -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 916] 0 916 374 21 0 0 0 aliyun-service
Dec 15 05:46:53 d001 kernel: [ 1156] 0 1156 62798 149 0 0 0 rsyslogd
Dec 15 05:46:53 d001 kernel: [ 1178] 32 1178 4744 62 0 0 0 rpcbind
Dec 15 05:46:53 d001 kernel: [ 1198] 29 1198 5837 112 1 0 0 rpc.statd
Dec 15 05:46:53 d001 kernel: [ 1226] 0 1226 201835 4673 0 0 0 glusterd
Dec 15 05:46:53 d001 kernel: [ 1382] 28 1382 157544 115 0 0 0 nscd
Dec 15 05:46:53 d001 kernel: [ 1414] 0 1414 118751 673 1 0 0 AliYunDunUpdateDec 15 05:46:53 d001 kernel: [ 1448] 0 1448 16657 178 4 -17 -1000 sshd
Dec 15 05:46:53 d001 kernel: [ 1463] 38 1463 6683 151 0 0 0 ntpd
Dec 15 05:46:53 d001 kernel: [ 1473] 0 1473 29325 154 1 0 0 crond
Dec 15 05:46:53 d001 kernel: [ 1516] 0 1516 1016 19 1 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1518] 0 1518 1016 17 3 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1520] 0 1520 1016 18 5 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1522] 0 1522 2661 104 4 -17 -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 1523] 0 1523 2661 104 4 -17 -1000 udevd
Dec 15 05:46:53 d001 kernel: [ 1524] 0 1524 1016 18 2 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1526] 0 1526 1016 19 4 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1528] 0 1528 1016 19 1 0 0 mingetty
Dec 15 05:46:53 d001 kernel: [ 1581] 0 1581 362551 34027 0 0 0 glusterfsd
Dec 15 05:46:53 d001 kernel: [ 1599] 0 1599 919605 653045 0 0 0 glusterfsd
Dec 15 05:46:53 d001 kernel: [ 1652] 0 1652 191799 1125 5 0 0 AliYunDun
Dec 15 05:46:53 d001 kernel: [ 1670] 0 1670 230065 974 0 0 0 AliHids
Dec 15 05:46:53 d001 kernel: [ 1800] 0 1800 1455594 1259392 0 0 0 glusterfs
Dec 15 05:46:53 d001 kernel: [ 2238] 0 2238 4872 55 1 0 0 anacron
Dec 15 05:46:53 d001 kernel: [ 2242] 0 2242 26524 43 1 0 0 run-parts
Dec 15 05:46:53 d001 kernel: [ 2413] 0 2413 26524 39 5 0 0 mlocate.cron
Dec 15 05:46:53 d001 kernel: [ 2414] 0 2414 26488 35 4 0 0 awk
Dec 15 05:46:53 d001 kernel: [ 2419] 0 2419 25272 117 2 0 0 updatedb
Dec 15 05:46:53 d001 kernel: Out of memory: Kill process 1800 (glusterfs) score 596 or sacrifice child
Dec 15 05:46:53 d001 kernel: Killed process 1800, UID 0, (glusterfs) total-vm:5822376kB, anon-rss:5037552kB, file-rss:16kB==============================================However, I can find oom in /var/log/messages matching every rebalance failure before. Is it a memory leak bug?Thank you.
PuYunFrom: PuYunDate: 2015-12-15 21:08To: gluster-usersSubject: Re: How to diagnose volume rebalance failure?Hi Susant,Thank you for your advice. I have set that parameter and started again in lazy mode. I think maybe it will not fail again this time because every time it fails it happens after migration started. Another important thing is fix-layout rebalance which is without migration can complete after running for 25 hours.And, I can't grep any errors in the rebalance log file using "grep ' E ' FastVol-rebalance.log".Thank you.
PuYunFrom: Susant PalaiDate: 2015-12-15 15:21To: PuYunCC: gluster-usersSubject: Re: How to diagnose volume rebalance failure?Hi PuYun,We need to figure out some mechanism to get the huge log files. Until then here is something I can think can be reason that can affect the performance.The rebalance normally starts in medium level [performance wise] which means for you in this case will generate two threads for migration which can hog on those 2 cores. In case you run rebalance again, run it in lazy mode. Here is the command."gluster v set <VOLUME-NAME> rebal-throttle lazy". This should spawn just one thread for migration.For logs: Can you grep for errors in the rebalance log file and upload? <till we figure out a method to get full logs>Thanks,Susant----- Original Message -----From: "PuYun" <cloudor@xxxxxxx>To: "gluster-users" <gluster-users@xxxxxxxxxxx>Sent: Tuesday, 15 December, 2015 5:51:00 AMSubject: Re: How to diagnose volume rebalance failure?Hi,Another weird piece of infomation may be useful. The failed task had actually been running for hours, but the status command output only 3621 sec.============== shell ==============[root@d001 glusterfs]# gluster volume rebalance FastVol statusNode Rebalanced-files size scanned failures skipped status run time in secs--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------localhost 0 0Bytes 952767 0 0 failed 3621.00volume rebalance: FastVol: success:================================As you can see, I started rebalance task for only 1 time.======== cmd_history.log-20151215 ======[2015-12-14 12:50:41.443937] : volume start FastVol : SUCCESS[2015-12-14 12:55:01.367519] : volume rebalance FastVol start : SUCCESS[2015-12-14 13:55:22.132199] : volume rebalance FastVol status : SUCCESS[2015-12-14 23:04:01.780885] : volume rebalance FastVol status : SUCCESS[2015-12-14 23:35:56.708077] : volume rebalance FastVol status : SUCCESS=================================Because the task failed at [ 2015-12-14 21:46:54.179xx], something wrong might happened at 3621 secs before, that is [ 2015-12-14 20:46:33.179xx]. I check logs at that time, found nothing special.========== FastVol-rebalance.log ========[2015-12-14 20:46:33.166748] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint: attempting to move from FastVol-client-0 to FastVol-client-1[2015-12-14 20:46:33.171009] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/t2/n1/VSXZlm65KjfhbgoM/flag_finished from subvolume FastVol-client-0 to FastVol-client-1[2015-12-14 20:46:33.174851] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg: attempting to move from FastVol-client-0 to FastVol-client-1[2015-12-14 20:46:33.181448] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/userPoint from subvolume FastVol-client-0 to FastVol-client-1[2015-12-14 20:46:33.184996] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_small.jpg: attempting to move from FastVol-client-0 to FastVol-client-1[2015-12-14 20:46:33.191681] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_origin.jpg from subvolume FastVol-client-0 to FastVol-client-1[2015-12-14 20:46:33.195396] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/Weixin.oClDcjjJ/rH/wV/mNv6sX94lypFWdvM/portrait_big_square.jpg: attempting to move from FastVol-client-0 to FastVol-client-1==================================And, there is no logs around at [ 2015-12-14 20:46:33.179xx ] in mnt-b1-brick.log, mnt-c1-brick.log and etc-glusterfs-glusterd.vol.log.PuYunFrom: PuYunDate: 2015-12-15 07:30To: gluster-usersSubject: Re: How to diagnose volume rebalance failure?Hi,Failed again. I can see disconnections in logs, but no more details.=========== mnt-b1-brick.log ===========[2015-12-14 21:46:54.179662] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0[2015-12-14 21:46:54.181764] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /[2015-12-14 21:46:54.181815] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir[2015-12-14 21:46:54.181856] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user[2015-12-14 21:46:54.181918] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg[2015-12-14 21:46:54.181961] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an[2015-12-14 21:46:54.182003] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif[2015-12-14 21:46:54.182036] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji[2015-12-14 21:46:54.182076] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay[2015-12-14 21:46:54.182110] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00[2015-12-14 21:46:54.182203] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0==================================================== mnt-c1-brick.log -============[2015-12-14 21:46:54.179597] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0[2015-12-14 21:46:54.180428] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 5e300cdb-7298-44c0-90eb-5b50018daed6 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}[2015-12-14 21:46:54.180454] W [inodelk.c:404:pl_inodelk_log_cleanup] 0-FastVol-server: releasing lock on 3c9a1cd5-84c8-4967-98d5-e75a402b1f74 held by {client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}[2015-12-14 21:46:54.180483] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /[2015-12-14 21:46:54.180525] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir[2015-12-14 21:46:54.180570] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user[2015-12-14 21:46:54.180604] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg[2015-12-14 21:46:54.180634] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji[2015-12-14 21:46:54.180678] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay[2015-12-14 21:46:54.180725] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an/ling00[2015-12-14 21:46:54.180779] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif[2015-12-14 21:46:54.180820] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /for_ybest_fsdir/user/ji/ay/an[2015-12-14 21:46:54.180859] I [MSGID: 101055] [client_t.c:419:gf_client_unref] 0-FastVol-server: Shutting down connection d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0==================================================== etc-glusterfs-glusterd.vol.log ==========[2015-12-14 21:46:54.179819] W [socket.c:588:__socket_rwv] 0-management: readv on /var/run/gluster/gluster-rebalance-dbee250a-e3fe-4448-b905-b76c5ba80b25.sock failed (No data available)[2015-12-14 21:46:54.209586] I [MSGID: 106007] [glusterd-rebalance.c:162:__glusterd_defrag_notify] 0-management: Rebalance process for volume FastVol has disconnected.[2015-12-14 21:46:54.209627] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=588 max=1 total=1[2015-12-14 21:46:54.209640] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy] 0-management: size=124 max=1 total=1=============================================================== FastVol-rebalance.log ============...[2015-12-14 21:46:53.423719] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume FastVol-client-0 to FastVol-client-1[2015-12-14 21:46:53.423976] I [MSGID: 109022] [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html from subvolume FastVol-client-0 to FastVol-client-1[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg: attempting to move from FastVol-client-0 to FastVol-client-1[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file] 0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif: attempting to move from FastVol-client-0 to FastVol-client-1<EOF>==============================================PuYunFrom: PuYunDate: 2015-12-14 21:51To: gluster-usersSubject: Re: How to diagnose volume rebalance failure?Hi,Thank you for your reply. I don't know how to send you the huge sized rebalance log file which is about 2GB.However, I might have found out the reason why the task failed. My gluster server has only 2 cpu cores and carries 2 ssd bricks. When the rebalance task began, top 3 processes are 70%~80%, 30%~40 and 30%~40 cpu usage. Others are less than 1%. But after a while, 2 CPU cores are used up totally and I even can't login until the rebalance task failed.It seems 2 bricks require 4 CPU cores at least. Now I upgrade the virtual server with 8 CPU cores and start rebalance task again. Everything goes well for now.I will report again when the current task completed or failed.PuYunFrom: Nithya BalachandranDate: 2015-12-14 18:57To: PuYunCC: gluster-usersSubject: Re: How to diagnose volume rebalance failure?Hi,Can you send us the rebalance log?Regards,Nithya----- Original Message -----> From: "PuYun" <cloudor@xxxxxxx>> To: "gluster-users" <gluster-users@xxxxxxxxxxx>> Sent: Monday, December 14, 2015 11:33:40 AM> Subject: Re: How to diagnose volume rebalance failure?>> Here is the tail of the failed rebalance log, any clue?>> [2015-12-13 21:30:31.527493] I [dht-rebalance.c:2340:gf_defrag_process_dir]> 0-FastVol-dht: Migration operation on dir> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/5F/1MsH5--BcoGRAJPI took 20.95 secs> [2015-12-13 21:30:31.528704] I [dht-rebalance.c:1010:dht_migrate_file]> 0-FastVol-dht:> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished:> attempting to move from FastVol-client-0 to FastVol-client-1> [2015-12-13 21:30:31.543901] I [dht-rebalance.c:1010:dht_migrate_file]> 0-FastVol-dht:> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint:> attempting to move from FastVol-client-0 to FastVol-client-1> [2015-12-13 21:31:37.210496] I [MSGID: 109081]> [dht-common.c:3780:dht_setxattr] 0-FastVol-dht: fixing the layout of> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q> [2015-12-13 21:31:37.722825] I [MSGID: 109045]> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 0> (FastVol-client-0): 1032124 chunks> [2015-12-13 21:31:37.722837] I [MSGID: 109045]> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume 1> (FastVol-client-1): 1032124 chunks> [2015-12-13 21:33:03.955539] I [MSGID: 109064]> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:> FastVol-client-0; inode layout - 0 - 2146817919 - 1; disk layout -> 2146817920 - 4294967295 - 1> [2015-12-13 21:33:04.069859] I [MSGID: 109018]> [dht-common.c:806:dht_revalidate_cbk] 0-FastVol-dht: Mismatching layouts for> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q, gfid => f38c4ed2-a26a-4d83-adfd-6b0331831738> [2015-12-13 21:33:04.118800] I [MSGID: 109064]> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:> FastVol-client-1; inode layout - 2146817920 - 4294967295 - 1; disk layout -> 0 - 2146817919 - 1> [2015-12-13 21:33:19.979507] I [MSGID: 109022]> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration> of> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished> from subvolume FastVol-client-0 to FastVol-client-1> [2015-12-13 21:33:19.979459] I [MSGID: 109022]> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration> of /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint> from subvolume FastVol-client-0 to FastVol-client-1> [2015-12-13 21:33:25.543941] I [dht-rebalance.c:1010:dht_migrate_file]> 0-FastVol-dht:> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_origin.jpg:> attempting to move from FastVol-client-0 to FastVol-client-1> [2015-12-13 21:33:25.962547] I [dht-rebalance.c:1010:dht_migrate_file]> 0-FastVol-dht:> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_small.jpg:> attempting to move from FastVol-client-0 to FastVol-client-1>>> Cloudor>>>> From: Sakshi Bansal> Date: 2015-12-12 13:02> To: 蒲云> CC: gluster-users> Subject: Re: How to diagnose volume rebalance failure?> In the rebalance log file you can check the file/directory for which the> rebalance has failed. It can mention what was the fop for whihc the failure> happened.>> _______________________________________________> Gluster-users mailing list> Gluster-users@xxxxxxxxxxx> http://www.gluster.org/mailman/listinfo/gluster-users_______________________________________________Gluster-users mailing listGluster-users@xxxxxxxxxxxhttp://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users