Re: Re: SMP and GFS

DeadManMoving <sequel@xxxxxxxxxxxx> · Sun, 02 Oct 2005 11:06:07 -0400

I'm running a cluster on two node without GFS (only using clurgmgrd to
export nfs share) on IBM x346 servers (Pentium 4 Xeon (Foster) with the
smp kernel; 2.6.9-11smp) and, while i do not see those errors in my
logs, i do see them in /proc/cluster/dlm_debug :

Magma send einval to 1
Magma send einval to 1
Magma send einval to 1
Magma send einval to 1
Magma send einval to 1
Magma (3055) req reply einval 440255 fr 1 r 1 usrm::rg="home_ma
Magma (3055) req reply einval 4b0262 fr 1 r 1 usrm::rg="home_ma
Magma send einval to 1
Magma (11923) req reply einval 5300f1 fr 1 r 1 usrm::vf
Magma send einval to 1
Magma (3055) req reply einval 530338 fr 1 r 1 usrm::vf

My cluster is highly instable, just this morning i've realized that the
clurgmgrd deamon was dead...

Can someone at Red Hat shed some light on this?

Thanks, Tony Lapointe.

On Sun, 2005-10-02 at 12:23 +0200, Axel Thimm wrote:
> On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> > Is there any  issue I should be aware of if SMP is enabled in
> > my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
> > 
> > I am running GFS in a dual Xeon server from DELL.
> 
> > After a lot of time running my GFS setup I got the following error
> > in one of our cluster servers, and I had to reboot it in order to
> > restablish the service:
> 
> > 
> > #################################################################################
> > Jul 14 14:19:35 atmail-2 kernel:  2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (18044) req reply einval ae2c0092 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (31381) req reply einval bf9901e7 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval d6c30333 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1
> > Jul 14 14:19:35 atmail-2 last message repeated 2 times
> 
> I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP
> Proliant). The machine crashed with a kernel panic shortly after
> telling the other nodes to leave the cluster (sorry the staff was
> under pressure and noone wrote down the panic's output):
> 
> Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster : Missed too many heartbeats (P:kernel)
> Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No response to messages (P:kernel)
> Sep 30 05:08:45 zs03 kernel: CMAN: quorum lost, blocking activity (P:kernel)
> 
> Seeking for the einval messages I found only this post here. So it
> doesn't seem to happen that often. OTOH it's the same hardware,
> perhaps dual Xeons are not good for GFS and/or the cluster
> infrastructure?
> 
> In my case kernel and GFS bits are all from Red Hat, no self built
> components other than a qla2xxx driver, but the issue is on the
> cluster communication side.
> --
> 
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster