On Wed, 2006-12-20 at 12:13 +1100, Nigel Cunningham wrote: > Hi all. > > A long while ago now, I spoke with someone (who I'll keep anonymous) > about the possibility of suspending a cluster to disk. The person seemed > to be reasonably excited about the idea, since it would potentially be > quite useful in a power outage situation with limited UPS capability > (particularly where the state of computations couldn't easily be > serialised and restarted later). > > I'm now in a situation where I don't have a lot of time to work on it, > but am interested in starting to make modifications to Suspend2 to add > such support. Before I do it, though, I wanted to ask whether you guys > as a whole would be interested in such support, or whether you think I'd > be wasting my time. These are not questions looking for answers - they're things to think about (and there will be more): * What happens if the suspend fails for one or more nodes? Is the cluster state lost as a whole? * What if the resume fails for one or more nodes? How do you handle getting the cluster back online automatically? * No matter how well STD & resume work, there will be changes while the cluster is offline which you will need to be able to handle during / after the resume phase (TCP connections & DHCP leases time out for example). > After that, I'd like to work toward > supporting suspending to shared storage. On suspending to shared storage: * Do you intend to be able to use this to replace machines? * How can one prevent a machine from resuming from the wrong memory image (or two machines resuming from the same image)? -- Lon
Attachment:
signature.asc
Description: This is a digitally signed message part
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster