On Monday 29 March 2010, Jiri Slaby wrote: > On 03/26/2010 11:04 PM, Rafael J. Wysocki wrote: > > I have some other comments to this patch, but I'd like to understand what > > really happens here, since the changelog is not too verbose. > > > > Please explain the design here. > > Yeah, my bad. > > What I want to achieve is to revert the behaviour of current > implementation to the one used in toi. Now, it is that when writing a > page to the swap, the hibernation does snapshot_read_page(page) with > swap_write_page(page) in a loop. This is OK until one needs to do > anything with the page. > > When compression or encryption (or both) are needed to be done, this > scenario does not work well because page returned by snapshot_read_page, > after going through the crypto layers, is no longer PAGE_SIZE big. Actually, s2disk has the same problem, because it only receives entire pages from the kernel and it needs to compress/encrypt them before writing to the storage (which also is written in blocks that are page size-aligned). Still, s2disk uses snapshot_[read\write]_next() through the functions in user.c, so I don't think we need to reorder things in there. > Probably the easiest solution is to revert it as noted above: a page is > taken from snapshot (with patches I have here the snapshot layer is only > told to "store next page" without returning a page to the caller), fed > through crypto layers as needed and finally given to chunk writer which > assembles PAGE_SIZE blocks from the chunks. Then whole pages of > compressed/encrypted data are given to user or in-kernel block io by > hibernate_io_ops->write_page. In-kernel .write_page simply calls > swap_write_page (which in turn hib_bio_write_page while storing swap > sector entries). That's fine if I understand correctly. > User .write_page, with buf as a parameter, does the following: > * to_do_buf = buf > * set WORK bit > * wake_up .read (below) > * wait_for WORK bit clear > > Writer in this case is a user process performing fops->read. .read does > the following: > * wait_for WORK bit > * copy_to_user to_do_buf > * clear WORK bit > * wake_up .write_page > > I need the barrier to ensure the "to_do_buf = buf" assignment is not > reordered with the "set WORK bit" in .write_page. Otherwise .read will > see WORK bit set, but have to_do_buf not updated. I certainly can lock > the two operation together, but (a) I see no point in doing so; (b) it > makes the code worse (I need to unlock and relock in wait_event and > unlock on all fail paths). > > The very similar happens for fops->write, ergo hibernate_io_ops->read_page. You can't open /dev/snapshot for reading and writing at the same time, because that wouldn't make any sense. /dev/snapshot represents a hibernation image which is presented to user space as a "file", because the user space is then able to use "regular" read() and write() on it. However, the situation during hibernation is much different than during resume. Namely, during hibernation the image is there in memory and it can't be modified. It can only be read and written to the storage. So, at this stage it only makes sense to open /dev/snapshot for reading. Now, during resume the image is not present in memory at all. In fact, we don't even use any contiguous region to store the pages read from the on-disk image. Instead, the pages are put directly into their destination page frames, if they are free. Otherwise, they are put into free page frames and then copied back to the target ones right before returning control to the image kernel. So during resume there's nothing that could be read from /dev/snapshot and you can only open it for writing. The idea is that snapshot_read_next() will only be used during hibernation for reading all of the image pages as contiguous stream of data (the real image is not contiguous in general), because that's what the user space needs. Then, the observation goes that the swap.c functions can use the same mechanism, although that is not really necessary in principle. Analogously, snapshot_write_next() is only useful during resume for writing all of the image pages as a contiguous stream of data, because that's needed by the user space and _can_ be used by the in-kernel code as well. [Note that snapshot_[read|write]_next() belong to the memory management part of the hibernation subsystem and that's exactly why they are in snapshot.c.] Now, compression can happen in two places: while the image is created or after it has been created (current behavior). In the latter case, the image pages need not be compressed in place, they may be compressed after being returned by snapshot_read_next(), in a temporary buffer (that's now s2disk does that). So you can arrange things like this: create image repeat: - snapshot_read_next() -> buffer - if buffer is full, compress it (possibly encrypt it) and write the result to the storage This way you'd just avoid all of the complications and I fail to see any drawbacks. Now, an attractive thing would be to compress data while creating the image and that may be done in the following way: have a buffer ready repeat: - copy image pages to the buffer (instead of copying them directly into the image storage space) - if buffer is full, compress it and copy the result to the image storage space, page by page. Thanks, Rafael _______________________________________________ linux-pm mailing list linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/linux-pm