Re: Clustering and suspend-to-disk?

Lon Hohberger <lhh@xxxxxxxxxx> · Wed, 20 Dec 2006 10:15:48 -0500

On Wed, 2006-12-20 at 12:13 +1100, Nigel Cunningham wrote:
> Hi all.
> 
> A long while ago now, I spoke with someone (who I'll keep anonymous)
> about the possibility of suspending a cluster to disk. The person seemed
> to be reasonably excited about the idea, since it would potentially be
> quite useful in a power outage situation with limited UPS capability
> (particularly where the state of computations couldn't easily be
> serialised and restarted later).
> 
> I'm now in a situation where I don't have a lot of time to work on it,
> but am interested in starting to make modifications to Suspend2 to add
> such support. Before I do it, though, I wanted to ask whether you guys
> as a whole would be interested in such support, or whether you think I'd
> be wasting my time.

These are not questions looking for answers - they're things to think
about (and there will be more):

* What happens if the suspend fails for one or more nodes?  Is the
cluster state lost as a whole?

* What if the resume fails for one or more nodes?  How do you handle
getting the cluster back online automatically?

* No matter how well STD & resume work, there will be changes while the
cluster is offline which you will need to be able to handle during /
after the resume phase (TCP connections & DHCP leases time out for
example).

> After that, I'd like to work toward
> supporting suspending to shared storage.

On suspending to shared storage:

* Do you intend to be able to use this to replace machines?

* How can one prevent a machine from resuming from the wrong memory
image (or two machines resuming from the same image)?

-- Lon
Attachment:
signature.asc

Description: This is a digitally signed message part
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster