Hi all,
I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel
2.6.18-53.el5. Last night, I tried to grow two of the file systems on it.
I ran lvextend and then gfs_grow on node3, with node2 serving the file
systems out to the local network. While gfs_grow was running, node2 failed
the service and I couldn't get it to restart. It looked to me like neither
node1 nor node2 was aware of the lvextend I had run on node3. I had to
reboot the full cluster to bring everything back online.
This afternoon, node2 fenced node3. Nothing migrated, and the entire
cluster needed to be rebooted again to recover. What I noticed after the
full reboot is I seem to be getting initial ARP responses from the wrong
nodes, as below:
[root@workstation ~]# arping cluster-fs1
ARPING 10.1.1.142 from 10.1.1.101 eth0
Unicast reply from 10.1.1.142 [00:1B:78:D1:88:C2] 0.624ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66] 0.666ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66] 0.621ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root@workstation ~]# arping cluster-fs2
ARPING 10.1.1.143 from 10.1.1.101 eth0
Unicast reply from 10.1.1.143 [00:1B:78:D1:88:C2] 0.695ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66] 0.734ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66] 0.680ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root@workstation ~]# arping cluster-fs3
ARPING 10.1.1.144 from 10.1.1.101 eth0
Unicast reply from 10.1.1.144 [00:1C:C4:81:9F:66] 0.734ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2] 0.913ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2] 0.640ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root@workstation ~]# arping node1
ARPING 10.1.1.131 from 10.1.1.101 eth0
Unicast reply from 10.1.1.1 [00:1B:78:D1:88:C2] 0.771ms
[...]
[root@workstation ~]# arping node2
ARPING 10.1.1.132 from 10.1.1.101 eth0
Unicast reply from 10.1.1.2 [00:1C:C4:81:AD:72] 0.681ms
[...]
[root@workstation ~]# arping node3
ARPING 10.1.1.133 from 10.1.1.101 eth0
Unicast reply from 10.1.1.3 [00:1C:C4:81:9F:66] 0.631ms
At the time, node1 was supposed to be serving fs1, fs2, and fs3. I'll note
that I did forget to run "lvmconf --enable-cluster" when I first set the
volume group up, though I did make that change before putting the cluster
into production.
Anyone have any thoughts on what's going on and what to do about it?
Thanks,
James
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster