On 12/04/2012 02:43 AM, Andrew Morton wrote: > On Fri, 30 Nov 2012 21:55:00 +0400 > Pavel Emelyanov <xemul@xxxxxxxxxxxxx> wrote: > >> This is an attempt to implement support for memory snapshot for the the >> checkpoint-restore project (http://criu.org). >> >> To create a dump of an application(s) we save all the information about it >> to files. No surprise, the biggest part of such dump is the contents of tasks' >> memory. However, in some usage scenarios it's not required to get _all_ the >> task memory while creating a dump. For example, when doing periodical dumps >> it's only required to take full memory dump only at the first step and then >> take incremental changes of memory. Another example is live migration. In the >> simplest form it looks like -- create dump, copy it on the remote node then >> restore tasks from dump files. While all this dump-copy-restore thing goes all >> the process must be stopped. However, if we can monitor how tasks change their >> memory, we can dump and copy it in smaller chunks, periodically updating it >> and thus freezing tasks only at the very end for the very short time to pick >> up the recent changes. >> >> That said, some help from kernel to watch how processes modify the contents of >> their memory is required. I'd like to propose one possible solution of this >> task -- with the help of page-faults and trace events. >> >> Briefly the approach is -- remap some memory regions as read-only, get the #pf >> on task's attempt to modify the memory and issue a trace event of that. Since >> we're only interested in parts of memory of some tasks, make it possible to mark >> the vmas we're interested in and issue events for them only. Also, to be aware >> of tasks unmapping the vma-s being watched, also issue an event when the marked >> vma is removed (and for symmetry -- an event when a vma is marked). >> >> What do you think about this approach? Is this way of supporting mem snapshot >> OK for you, or should we invent some better one? > > The patches look pretty simple. > > Some performance numbers would be useful. > > Is it reliable? Under what circumstances will the trace system drop > events? AFAIS when the buffer for events overflows, but the buffer size can be tuned. I will write some mode descriptive text about it if the tracing approach will be considered to be the way to go. > Please cc Steven Rostedt on tracing stuff - he is a diligent reviewer. OK. > The proposed interface might be useful to things other than c/r. But > it hasn't actually been described. Please include a full description > of the proposed kernel/usersapce interface. OK, will try to address that. > Two alternatives come to mind: > > 1) Use /proc/pid/pagemap (Documentation/vm/pagemap.txt) in some > fashion to determine which pages have been touched. I thought about this. Unfortunately there's no free bits left in the pagemap entry. What can we do about it (other than introducing the pagemap2 file)? > 2) At pagefault time, don't send an event: just mark the vma as > "touched". Then add a userspace interface to sweep the vma tree > testing, clearing and reporting the touched flags. Per-vma granularity is not enough. In OpenVZ we've observed Oracle touching several pages in a hundred-megs anon mapping. Marking _part_ of the vma with the "node write-faults" bit would help, but there's currently no APIs that modifies vma and report some info back at the same time. Can you propose how it could look like? > 2a) Avoid the full linear search by propagating the "touched" flag > up the rbtree and do the sweep in a fashion similar to > radix_tree_for_each_tagged(). > . Thanks, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>