Re: How to diagnose volume rebalance failure?

PuYun <cloudor@xxxxxxx> · Fri, 18 Dec 2015 09:46:40 +0800

Hi Susant,

You are right, the rebalance process itself is normal now. But the writing brick keeps increasing during rebalancing. Current task has been running for 16 hours, here is the top info.

===================== top ===========================
top - 08:58:27 up 3 days, 12:08,  1 user,  load average: 1.33, 1.18, 1.21
Tasks: 173 total,   1 running, 172 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.0%us, 16.9%sy,  0.0%ni, 65.7%id,  2.7%wa,  0.0%hi,  1.8%si,  0.0%st
Mem:   8060900k total,  7923204k used,   137696k free,  4528380k buffers
Swap:        0k total,        0k used,        0k free,   393444k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8555 root      20   0  950m 143m 1728 S 154.7  1.8 875:01.07 glusterfs
 8479 root      20   0 1284m 139m 1892 S 69.8  1.8 443:25.88 glusterfsd
 8497 root      20   0 2628m 1.8g 1892 S 68.2 23.0 485:31.42 glusterfsd
  874 root      20   0     0    0    0 S  2.3  0.0  65:34.68 jbd2/vdb1-8
   58 root      20   0     0    0    0 S  0.7  0.0  44:44.37 kblockd/0
   99 root      20   0     0    0    0 S  0.7  0.0  39:17.63 kswapd0
   39 root      20   0     0    0    0 S  0.3  0.0   0:16.90 events/4
=====================================================
As you can see, the PID 8497 takes 1.8g mem now. 

I have taken some state dumps. Later dumps are much bigger than the earlier.
================ ls -lh /var/run/gluster/*dump* ================
-rw------- 1 root root 4.1M Dec 17 17:52 mnt-b1-brick.8497.dump.1450345948
-rw------- 1 root root 292M Dec 18 09:08 mnt-b1-brick.8497.dump.1450400909
-rw------- 1 root root 297M Dec 18 09:15 mnt-b1-brick.8497.dump.1450401273
=====================================================

You can download these state dumps (gziped) from this url:
http://pan.baidu.com/s/1jHuZCMU

PuYun

From: Susant Palai
Date: 2015-12-17 20:23
To: PuYun
CC: gluster-users
Subject: Re:  How to diagnose volume rebalance failure?
Ok from your reply rebalance seems to be fine. 
So what you can do is check whether the mem-usage of brick process keeps increasing constantly. If that is the case take multiple state-dumps intermittently.

Regards,
Susant 

----- Original Message -----
From: "PuYun" <cloudor@xxxxxxx>
To: "gluster-users" <gluster-users@xxxxxxxxxxx>
Cc: "gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Thursday, 17 December, 2015 3:57:12 PM
Subject: Re:  How to diagnose volume rebalance failure?

Hi Susant, 

Thank you for your instructions. I'll do that. 

My volume contains more than 2 million end sub directories. Most of the end sub directories contains 10~30 small files. Current total size is about 900G. Two bricks, each one is 1T. Current ram size is 8G. 

Previously I saw 3 processes, one is glusterfs for rebalance and 2 glusterfsd for bricks. Only 1 glusterfsd occupied very large mem and it is related to the newly added brick. The other 2 processes seems normal. If that happens again, I will send you the state dump. 

Thank you. 

PuYun 

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users