On Wed, 2005-01-12 at 00:58, Patrick Caulfield wrote: > On Tue, Jan 11, 2005 at 05:00:46PM -0800, Daniel McNeil wrote: > > On Tue, 2005-01-11 at 00:56, Patrick Caulfield wrote: > > > On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote: > > > > How long does cman stay up in your testing? > > > > > > With the higher pririty on the heartbeat thread I got 5 days before iSCSI died > > > on me again... This isn't quite the same load as yours but it is on 8 busy nodes. > > > > I have not seen 5 days yet on my set. See my email from yesterday. > > Is the code to have higher priority for the heartbeat thread > > already checked in? I restarted my test yesterday and it is > > still going, but it usually has trouble after 50 hours or so. > > > > It's rev 1.45 of membership.c checked in on the 7th Jan. If that hasn't fixed it > I'll have to dabble with realtime things as it does seem now that the threads > are not being woken up, even though the timer is firing. I'm running from code as of Jan 4th, so I do not have that change. I'll updated my code. 2 nodes died last night running my tests with echo "9" > /proc/cluster/config/cman/max_retries echo "1" > /proc/cluster/config/cman/hello_timer here's the output on the console from the 3 nodes: cl030: CMAN: no HELLO from cl031a, removing from the cluster CMAN: node cl032a is not responding - removing from the cluster CMAN: quorum lost, blocking activity cl031: CMAN: node cl030a is not responding - removing from the cluster CMAN: node cl032a is not responding - removing from the cluster SM: Assertion failed on line 67 of file /Views/redhat-cluster/cluster/cman-kernel/src/sm_membership.c SM: assertion: "node" SM: time = 115176056 Kernel panic - not syncing: SM: Record message above and reboot. Message from syslogd@cl031 at Wed Jan 12 01:17:57 2005 ... Record message above and reboot. syncing: SM: cl032: CMAN: too many transition restarts - will die Daniel