On Mon, May 07, 2007 at 01:54:56PM -0400, rhurst@xxxxxxxxxxxxxxxxx wrote: > What could cause clurgmgrd fail like this? If clurgmgrd has a hiccup > like this, is it supposed to shutdown its services? Is there something > in our implementation that could have prevented this from shutting down? > > For unexplained reasons, we just had our CS service (WATSON) go down on > its own, and the syslog entry details the event as: > > May 7 13:18:39 db1 clurgmgrd[17888]: <err> #48: Unable to obtain > cluster lock: Connection timed out > May 7 13:18:41 db1 kernel: dlm: Magma: reply from 2 no lock > May 7 13:18:41 db1 kernel: dlm: reply > May 7 13:18:41 db1 kernel: rh_cmd 5 > May 7 13:18:41 db1 kernel: rh_lkid 200242 > May 7 13:18:41 db1 kernel: lockstate 2 > May 7 13:18:41 db1 kernel: nodeid 0 > May 7 13:18:41 db1 kernel: status 0 > May 7 13:18:41 db1 kernel: lkid ee0388 > May 7 13:18:41 db1 clurgmgrd[17888]: <notice> Stopping service WATSON This usually is a dlm bug. Once the DLM gets in to this state, rgmanager blows up. What rgmanager are you using? (There's only one lock per service; the complexity of the service doesn't matter...) -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster