[Linux-cluster] lock_gulm heartbeat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Everyone,


I have a gfs cluster that has been running fine for about 3 months now.
I am only using 2 machines and the storage is a firewire drive. Over the
weekend i started to get:
lock_gulmd_core[1100]: Failed to receive a timely heartbeat reply from
Master. (t:1108583425506998 mb:1)

and after 2, which is what i allowed for the missed heart beats, the gfs
slave would die. I moved the missed heartbeat up to 5 and have seen it
miss as many as 4 in a row. The only thing that has changed on the
machine is that i add new clients to process files once they are placed
on the machine. I am using FAM to notify my app that a new file is
present.

Any ideas on what i should look at? How can i diagnose this problem. The
communication between the two machines seems fine. I can ping both
hosts. I am really at a loss at to what to look for.


Thanks for the help.



[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux