> only issue. > > > > > A P.S. here, I just looked over the agent list rebuilding code, and > > > > my race detector is beeping full on. I'll have a ponder before > > > > offering specifics though. > > > > > > That's a question I have. The ast() function gets called by > > > dlm_dispatch(). right? If so, I don't see the race. If not, there is > > > one hell of dangerous race. If the agent_list is changing when agent > > > is trying to contact the other agents, bad things will most likely > > > happen. > > > > If the list changes and we don't know about the changes while waiting to > > get answers back from other agents, we're dead in the water. So the > > recovery algorithm must handle membership changes that happen in > > parallel. After much pondering, I think I've got a reasonably simple > > algorithm, I'll write it up now. > > Um... but since we wait for agent responses in the same poll loop that we wait > for membership change notifications, these two things already do happen in > parallel... well... mostly. The only issue I see is that we could get the event > from magma, and then block trying to get the member_list. But since > that's a local call, if that's hangs forever, then cman is in trouble, > and there isn't much we can do anyway. But there is no chance of not getting > a membership change because we are waiting on a agent response. Just to clarify. The issue that I had earlier mentioned is this: If the ast() code and the rebuild_agent_list() code executed at the same time, which I don't believe they can, they are both using the same data structures, and could muck each other up. > > Well, we have for sure gotten to the interesting part of this, how about > > we continue in linux-cluster? > > > > Sure. But I'm not sure if anyone else is interested in implementation details. > > > Regards, > > > > Daniel > > -Ben > > -- > > Linux-cluster@xxxxxxxxxx > http://www.redhat.com/mailman/listinfo/linux-cluster