Everyone,
I have a gfs cluster that has been running fine for about 3 months now. I am only using 2 machines and the storage is a firewire drive. Over the weekend i started to get: lock_gulmd_core[1100]: Failed to receive a timely heartbeat reply from Master. (t:1108583425506998 mb:1)
and after 2, which is what i allowed for the missed heart beats, the gfs slave would die. I moved the missed heartbeat up to 5 and have seen it miss as many as 4 in a row. The only thing that has changed on the machine is that i add new clients to process files once they are placed on the machine. I am using FAM to notify my app that a new file is present.
Any ideas on what i should look at? How can i diagnose this problem. The communication between the two machines seems fine. I can ping both hosts. I am really at a loss at to what to look for.
It sounds like it might be a load problem. The node is busy doing other work and it doesn't give enough time to the gulm core process to do heartbeats. You should try increacing the heartbeat_rate some. And maybe nicing the lock_gulmd processes some.
If you want to try tacking all of the heartbeat messages, set the 'heartbeat' verbosity flag.
-- Michael Conrad Tadpol Tilstra AH! Get off my leg! You're Not my type!!!!
Attachment:
signature.asc
Description: OpenPGP digital signature