----- Original Message ----- > From: "Joe Julian" <joe@xxxxxxxxxxxxxxxx> > To: "Anuradha Talur" <atalur@xxxxxxxxxx>, "Vince Loschiavo" <vloschiavo@xxxxxxxxx> > Cc: "gluster-users@xxxxxxxxxxx" <Gluster-users@xxxxxxxxxxx> > Sent: Friday, November 21, 2014 12:06:27 PM > Subject: Re: v3.6.1 vs v3.5.2 self heal - help (Nagios related) > > > > On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur@xxxxxxxxxx> > wrote: > > > > > >----- Original Message ----- > >> From: "Vince Loschiavo" <vloschiavo@xxxxxxxxx> > >> To: "gluster-users@xxxxxxxxxxx" <Gluster-users@xxxxxxxxxxx> > >> Sent: Wednesday, November 19, 2014 9:50:50 PM > >> Subject: v3.6.1 vs v3.5.2 self heal - help (Nagios > >related) > >> > >> > >> Hello Gluster Community, > >> > >> I have been using the Nagios monitoring scripts, mentioned in the > >below > >> thread, on 3.5.2 with great success. The most useful of these is the > >self > >> heal. > >> > >> However, I've just upgraded to 3.6.1 on the lab and the self heal > >daemon has > >> become quite aggressive. I continually get alerts/warnings on 3.6.1 > >that > >> virt disk images need self heal, then they clear. This is not the > >case on > >> 3.5.2. This > >> > >> Configuration: > >> 2 node, 2 brick replicated volume with 2x1GB LAG network between the > >peers > >> using this volume as a QEMU/KVM virt image store through the fuse > >mount on > >> Centos 6.5. > >> > >> Example: > >> on 3.5.2: > >> gluster volume heal volumename info: shows the bricks and number of > >entries > >> to be healed: 0 > >> > >> On v3.5.2 - During normal gluster operations, I can run this command > >over and > >> over again, 2-4 times per second, and it will always show 0 entries > >to be > >> healed. I've used this as an indicator that the bricks are > >synchronized. > >> > >> Last night, I upgraded to 3.6.1 in lab and I'm seeing different > >behavior. > >> Running gluster volume heal volumename info , during normal > >operations, will > >> show a file out-of-sync, seemingly between every block written to > >disk then > >> synced to the peer. I can run the command over and over again, 2-4 > >times per > >> second, and it will almost always show something out of sync. The > >individual > >> files change, meaning: > >> > >> Example: > >> 1st Run: shows file1 out of sync > >> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in > >sync (not > >> in the list) > >> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in > >sync > >> (not in the list). > >> ... > >> nth run: shows 0 files out of sync > >> nth+1 run: shows file 3 and 12 out of sync. > >> > >> From looking at the virtual machines running off this gluster volume, > >it's > >> obvious that gluster is working well. However, this obviously plays > >havoc > >> with Nagios and alerts. Nagios will run the heal info and get > >different and > >> non-useful results each time, and will send alerts. > >> > >> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to > >tune the > >> settings or change the monitoring method to get better results into > >Nagios. > >> > >In 3.6.1 the way heal info command works is different from that in > >3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that > >might need healing. Currently, in 3.6.1, there isn't a method to > >distinguish between a file that is being healed and a file with > >on-going I/O while listing. Hence you see files with normal operation > >too listed in the output of heal info command. > > How did that regression pass?! Test cases to check this condition was not written in regression tests. > -- Thanks, Anuradha. _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users