Re: v3.6.1 vs v3.5.2 self heal - help (Nagios related)

Vince Loschiavo <vloschiavo@xxxxxxxxx> · Sat, 22 Nov 2014 08:42:22 -0800

Thank you for that information.
Are there plans to restore the previous functionality in a later release of 3.6.x? Or is this what we should expect going forward?

On Thu, Nov 20, 2014 at 11:24 PM, Anuradha Talur <atalur@xxxxxxxxxx> wrote:

----- Original Message -----

> From: "Joe Julian" <joe@xxxxxxxxxxxxxxxx>

> To: "Anuradha Talur" <atalur@xxxxxxxxxx>, "Vince Loschiavo" <vloschiavo@xxxxxxxxx>

> Cc: "gluster-users@xxxxxxxxxxx" <Gluster-users@xxxxxxxxxxx>

> Sent: Friday, November 21, 2014 12:06:27 PM

> Subject: Re:  v3.6.1 vs v3.5.2 self heal - help (Nagios related)

>

>

>

> On November 20, 2014 10:01:45 PM PST, Anuradha Talur <atalur@xxxxxxxxxx>

> wrote:

> >

> >

> >----- Original Message -----

> >> From: "Vince Loschiavo" <vloschiavo@xxxxxxxxx>

> >> To: "gluster-users@xxxxxxxxxxx" <Gluster-users@xxxxxxxxxxx>

> >> Sent: Wednesday, November 19, 2014 9:50:50 PM

> >> Subject:  v3.6.1 vs v3.5.2 self heal - help (Nagios

> >related)

> >>

> >>

> >> Hello Gluster Community,

> >>

> >> I have been using the Nagios monitoring scripts, mentioned in the

> >below

> >> thread, on 3.5.2 with great success. The most useful of these is the

> >self

> >> heal.

> >>

> >> However, I've just upgraded to 3.6.1 on the lab and the self heal

> >daemon has

> >> become quite aggressive. I continually get alerts/warnings on 3.6.1

> >that

> >> virt disk images need self heal, then they clear. This is not the

> >case on

> >> 3.5.2. This

> >>

> >> Configuration:

> >> 2 node, 2 brick replicated volume with 2x1GB LAG network between the

> >peers

> >> using this volume as a QEMU/KVM virt image store through the fuse

> >mount on

> >> Centos 6.5.

> >>

> >> Example:

> >> on 3.5.2:

> >> gluster volume heal volumename info: shows the bricks and number of

> >entries

> >> to be healed: 0

> >>

> >> On v3.5.2 - During normal gluster operations, I can run this command

> >over and

> >> over again, 2-4 times per second, and it will always show 0 entries

> >to be

> >> healed. I've used this as an indicator that the bricks are

> >synchronized.

> >>

> >> Last night, I upgraded to 3.6.1 in lab and I'm seeing different

> >behavior.

> >> Running gluster volume heal volumename info , during normal

> >operations, will

> >> show a file out-of-sync, seemingly between every block written to

> >disk then

> >> synced to the peer. I can run the command over and over again, 2-4

> >times per

> >> second, and it will almost always show something out of sync. The

> >individual

> >> files change, meaning:

> >>

> >> Example:

> >> 1st Run: shows file1 out of sync

> >> 2nd run: shows file 2 and file 3 out of sync but file 1 is now in

> >sync (not

> >> in the list)

> >> 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in

> >sync

> >> (not in the list).

> >> ...

> >> nth run: shows 0 files out of sync

> >> nth+1 run: shows file 3 and 12 out of sync.

> >>

> >> From looking at the virtual machines running off this gluster volume,

> >it's

> >> obvious that gluster is working well. However, this obviously plays

> >havoc

> >> with Nagios and alerts. Nagios will run the heal info and get

> >different and

> >> non-useful results each time, and will send alerts.

> >>

> >> Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to

> >tune the

> >> settings or change the monitoring method to get better results into

> >Nagios.

> >>

> >In 3.6.1 the way heal info command works is different from that in

> >3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that

> >might need healing. Currently, in 3.6.1, there isn't a method to

> >distinguish between a file that is being healed and a file with

> >on-going I/O while listing. Hence you see files with normal operation

> >too listed in the output of heal info command.

>

> How did that regression pass?!

Test cases to check this condition was not written in regression tests.

>

--

Thanks,

Anuradha.

-- 
-Vince Loschiavo

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users