Hi Carlos, Thanks for coming back to me… in response to your queries: PID is low, 1153 for glusterd with glusterfsd 1168 and 2 x glusterfs with 1318 and 1319 so I’d agree… it doesn’t seem that glusterd is crashing and being restarted. As of today, Monday morning top is reporting 1398 glusterd zombie processes. I have this problem on all 4 of my gluster nodes and all four are being monitored by the attached nagios plugin. In terms of testing, I’ve prevented nagios from running the attached check script and restarted the glusterd process using “service glusterd restart”. I’ve let it run for a few hours and haven’t yet seen any zombie processes created. This I think is good as, for whatever reason, it appears to point at the nagios check script being the problem. My next check was to run the nagios check once to see if it created a Zombie process… it did…. So I started looking at the script. I forced the script to exit after the first command “gluster volume heal audio info” and no Zombie process was created. This pointed me to the second which takes this form…. I’m no expert of HERE documents in shell but I think that it maybe causing the issue: while read -r line; do field=($(echo $line)) case ${field[0]} in Brick) brick=${field[@]:2} ;; Disk) key=${field[@]:0:3} if [ "${key}" = "Disk Space Free" ]; then freeunit=${field[@]:4} unit=${freeunit: -2} free=${freeunit%$unit} if [ "$unit" != "GB" ]; then Exit UNKNOWN "Unknown disk space size $freeunit\n" fi if (( $(bc <<< "${free} < ${freegb}") == 1 )); then freegb=$free fi fi ;; Online) _online_=${field[@]:2} if [ "${online}" = "Y" ]; then let $((bricksfound++)) else errors=("${errors[@]}" "$brick offline") fi ;; esac done < <( sudo gluster volume status ${VOLUME} detail) Anyone spot why this would be an issue? Thanks, Steve From: Carlos Capriotti [mailto:capriotti.carlos@xxxxxxxxx] ok, let's see if we can gather more info. I am not a specialist, but you know... another pair of eyes. My system has a single glusterd process and it has a pretty low PID, meaning it has not crashed. What is your PID for your glusterd ? how many zombie processes are there reported by top ? I've been running my preliminary tests with gluster for a little over a month now and have never seen this. My platform is CentOS 6.5, so, I'd say it is pretty similar. From my perspective, even making gluster sweat, running some intense rsync jobs in parallel, and seeing glusterd AND glusterfs take 120% of processing time on top (each on one core), they never crashed. My zombie count, from top, is zero. On the other hand, I had one of my nodes, the other day, crashing a process every time I started a high demanding task. Ends up I had (and still have) a hardware problem on one of the processor (or the main board; still undiagnosed). Do you have this problem on one node only ? Any chance you have something special compiled on your kernel ? Any particularly memory-hungry tweak on your sysctl ? Sounds like the system, not gluster. KR, Carlos On Fri, Mar 21, 2014 at 10:29 PM, Steve Thomas <sthomas@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
|
Attachment:
check_glusterfs.sh
Description: check_glusterfs.sh
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users