On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur@xxxxxxxxxx> wrote: > > >----- Original Message ----- >> From: "Vince Loschiavo" <vloschiavo@xxxxxxxxx> >> To: "gluster-users@xxxxxxxxxxx" <Gluster-users@xxxxxxxxxxx> >> Sent: Wednesday, November 19, 2014 9:50:50 PM >> Subject: v3.6.1 vs v3.5.2 self heal - help (Nagios >related) >> >> >> Hello Gluster Community, >> >> I have been using the Nagios monitoring scripts, mentioned in the >below >> thread, on 3.5.2 with great success. The most useful of these is the >self >> heal. >> >> However, I've just upgraded to 3.6.1 on the lab and the self heal >daemon has >> become quite aggressive. I continually get alerts/warnings on 3.6.1 >that >> virt disk images need self heal, then they clear. This is not the >case on >> 3.5.2. This >> >> Configuration: >> 2 node, 2 brick replicated volume with 2x1GB LAG network between the >peers >> using this volume as a QEMU/KVM virt image store through the fuse >mount on >> Centos 6.5. >> >> Example: >> on 3.5.2: >> gluster volume heal volumename info: shows the bricks and number of >entries >> to be healed: 0 >> >> On v3.5.2 - During normal gluster operations, I can run this command >over and >> over again, 2-4 times per second, and it will always show 0 entries >to be >> healed. I've used this as an indicator that the bricks are >synchronized. >> >> Last night, I upgraded to 3.6.1 in lab and I'm seeing different >behavior. >> Running gluster volume heal volumename info , during normal >operations, will >> show a file out-of-sync, seemingly between every block written to >disk then >> synced to the peer. I can run the command over and over again, 2-4 >times per >> second, and it will almost always show something out of sync. The >individual >> files change, meaning: >> >> Example: >> 1st Run: shows file1 out of sync >> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in >sync (not >> in the list) >> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in >sync >> (not in the list). >> ... >> nth run: shows 0 files out of sync >> nth+1 run: shows file 3 and 12 out of sync. >> >> From looking at the virtual machines running off this gluster volume, >it's >> obvious that gluster is working well. However, this obviously plays >havoc >> with Nagios and alerts. Nagios will run the heal info and get >different and >> non-useful results each time, and will send alerts. >> >> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to >tune the >> settings or change the monitoring method to get better results into >Nagios. >> >In 3.6.1 the way heal info command works is different from that in >3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that >might need healing. Currently, in 3.6.1, there isn't a method to >distinguish between a file that is being healed and a file with >on-going I/O while listing. Hence you see files with normal operation >too listed in the output of heal info command. How did that regression pass?! _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users