Hi, On Tuesday 14 March 2006 00:11, Nigel Cunningham wrote: > On Tuesday 14 March 2006 08:42, Rafael J. Wysocki wrote: > > On Monday 13 March 2006 23:08, Pavel Machek wrote: > > > > > > Yep, I call that suspend-to-both. It is planned, but not really > > > > > > trivial, and I'm a little busy. If someone wants to help.... > > > > > > > > > > I was thinking a few days ago. With your move of all this stuff to > > > > > userspace, if it was done in multiple stages, we could implement > > > > > a form of checkpointing this way. > > > > > > > > > > So instead of doing the 'suspend to disk/ram' after 'write out all > > > > > pages', we just continue. > > > > > > > > > > Why is this useful ? We've seen bugs reported that only ever bite > > > > > customers after they've run their workload for a month. Now, if they > > > > > had a means of checkpointing, then when it crashes, they could > > > > > capture the last image that landed somewhere, and set that up for > > > > > more tests/monitoring with kprobes etc and reproduce those > > > > > hard-to-reproduce bugs a lot faster. > > > > > > > > I've been asked about this from time to time too. Apart from the issues > > > > Pavel has already mentioned, the big problem in my mind was figuring > > > > out what to do about disk storage. As the algorithm stands at the > > > > moment, the image includes information about the state of mounted > > > > filesystems. We'd need to somehow get rid of or be able to ignore that. > > > > Any suggestions? > > > > > > Well, copying all the filesystems would work, as would having no > > > filesystems at all :-) [ramdisk case]. And perhaps practical > > > equivalent of "copy all filesystems" can be done with device mapper. > > > > > > [Of course, you'd have to copy all the filesystems back before doing > > > resume]. > > > > If we had anything like fs suspend/resume, we could handle such things. > > We could also handle the "USB device mounted before suspend" problem > > (I think it's related). > > Well, we have bdev freezing, which I guess is what is used for fixing up raid > mirrors (but don't know for certain). I use it in refrigerating to get XFS to > really stop activity. I don't think it helps in this case though: I don't think so too. > We need to be able to rollback the state of the filesystem in memory and on > disk to the point where the last checkpoint was made. Memory would be > straight forward if we want to do it dumbly and slowly - just reload the > whole check pointed image. If we want to be more efficient, we'd want to just > load the pages that had changed (Mark on (first) write?). But filesystems > seem to be a whole different story. Do any of the commonly used fses have > support for checkpointing and rollback back at the moment? I'm not sure if we need a rollback as such. What we need is to make sure the filesystems state will be consistent before as well as after we have "reloaded" the snapshot. Greetings, Rafael