On Thu, 2005-06-23 at 11:23 -0400, Olivier Crete wrote: > > If you use CMAN's service manager, you will be able to tell if the app > > has crashed (all nodes in that service group will be notified of the > > state change). > > In the RHEL4 branch there does not seem to be a userspace API for the > Service manager.. apart from the ioctl and libmagma. Is libmagma your > long term api ? Also, can libmagma be used in non-GPL apps? I saw some > scary comments in magmamsg.h... Correct. libmagma is LGPL. The theory is that you could write any app and link it dynamically against it. Furthermore, the idea was that, on the back-end, you could make it load a non-Free plugin to talk to a non-Free cluster infrastructure. libmagmamsg is GPL, due to having code chunks from an older GPL project. The code is quite awful, but it works. I hope it gets replaced with an nice cluster-agnostic message system at some point which can be used for more than just cluster stuff. > > Internal deadlocks are harder to detect from the cluster infrastructure > > perspective. I'd consider using the kernel watchdog timer. > > An easy way would be to have a cluster watchdog (ie.. the app must > "ping" the cman daemon at least one in X seconds and if it isnt its > considered deadlocked..) As it stands now, kernel-mode cman doesn't have this kind of capability. I could be mistaken, of course. > > First off: Generally, an application crashing shouldn't generally cause > > an eviction of the node from the cluster. There should be other > > cleanup/coordination mechanisms in place. Ok, that said: > > Our application uses semi-shared storage, and if it crashes.. it may > leave it in an unknown state.. and the easiest way is just to reboot the > machine and have another machine take over the storage.. I'd definitely wire watchdog timer stuff in to your app. It solves both the "crash" and "hang" cases. > > * With libgulm, you can register as an "important" service: "If this > > process dies, evict & fence me." > > But gulm is going away, right ? Maybe ;) > > The other caveat was that you didn't want to be controlled by resource > > scripts / managers, right? > > Ideally, I'd want to reduce the amount of forking... Especially when a critical event happens. Ok. -- Lon -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster