Re: [RFC 09/15] PM / Hibernate: user, implement user_ops writer

Nigel Cunningham <ncunningham@xxxxxxxxxxx> · Tue, 30 Mar 2010 12:14:54 +1100

Hi.

On 30/03/10 09:09, Rafael J. Wysocki wrote:
> On Monday 29 March 2010, Jiri Slaby wrote:
>> On 03/26/2010 11:04 PM, Rafael J. Wysocki wrote:
>>> I have some other comments to this patch, but I'd like to understand what
>>> really happens here, since the changelog is not too verbose.
>>>
>>> Please explain the design here.
>>
>> Yeah, my bad.
>>
>> What I want to achieve is to revert the behaviour of current
>> implementation to the one used in toi. Now, it is that when writing a
>> page to the swap, the hibernation does snapshot_read_page(page) with
>> swap_write_page(page) in a loop. This is OK until one needs to do
>> anything with the page.
>>
>> When compression or encryption (or both) are needed to be done, this
>> scenario does not work well because page returned by snapshot_read_page,
>> after going through the crypto layers, is no longer PAGE_SIZE big.
>
> Actually, s2disk has the same problem, because it only receives entire pages
> from the kernel and it needs to compress/encrypt them before writing to the
> storage (which also is written in blocks that are page size-aligned).  Still,
> s2disk uses snapshot_[read\write]_next() through the functions in user.c, so
> I don't think we need to reorder things in there.
>
>> Probably the easiest solution is to revert it as noted above: a page is
>> taken from snapshot (with patches I have here the snapshot layer is only
>> told to "store next page" without returning a page to the caller), fed
>> through crypto layers as needed and finally given to chunk writer which
>> assembles PAGE_SIZE blocks from the chunks. Then whole pages of
>> compressed/encrypted data are given to user or in-kernel block io by
>> hibernate_io_ops->write_page. In-kernel .write_page simply calls
>> swap_write_page (which in turn hib_bio_write_page while storing swap
>> sector entries).
>
> That's fine if I understand correctly.
>
>> User .write_page, with buf as a parameter, does the following:
>> * to_do_buf = buf
>> * set WORK bit
>> * wake_up .read (below)
>> * wait_for WORK bit clear
>>
>> Writer in this case is a user process performing fops->read. .read does
>> the following:
>> * wait_for WORK bit
>> * copy_to_user to_do_buf
>> * clear WORK bit
>> * wake_up .write_page
>>
>> I need the barrier to ensure the "to_do_buf = buf" assignment is not
>> reordered with the "set WORK bit" in .write_page. Otherwise .read will
>> see WORK bit set, but have to_do_buf not updated. I certainly can lock
>> the two operation together, but (a) I see no point in doing so; (b) it
>> makes the code worse (I need to unlock and relock in wait_event and
>> unlock on all fail paths).
>>
>> The very similar happens for fops->write, ergo hibernate_io_ops->read_page.

I'm still trying to find the time to wrap my head around the uswsusp 
interface and locking bit (Easter is a busy time for me), so I'll not 
say anything here.

> You can't open /dev/snapshot for reading and writing at the same time, because
> that wouldn't make any sense.
>
> /dev/snapshot represents a hibernation image which is presented to user space
> as a "file", because the user space is then able to use "regular" read() and
> write() on it.  However, the situation during hibernation is much different
> than during resume.  Namely, during hibernation the image is there in memory
> and it can't be modified.  It can only be read and written to the storage.  So,
> at this stage it only makes sense to open /dev/snapshot for reading.
>
> Now, during resume the image is not present in memory at all.  In fact, we
> don't even use any contiguous region to store the pages read from the on-disk
> image.  Instead, the pages are put directly into their destination page frames,
> if they are free.  Otherwise, they are put into free page frames and then
> copied back to the target ones right before returning control to the image
> kernel.  So during resume there's nothing that could be read from /dev/snapshot
> and you can only open it for writing.
>
> The idea is that snapshot_read_next() will only be used during hibernation
> for reading all of the image pages as contiguous stream of data (the real image
> is not contiguous in general), because that's what the user space needs.  Then,
> the observation goes that the swap.c functions can use the same mechanism,
> although that is not really necessary in principle.
>
> Analogously, snapshot_write_next() is only useful during resume for writing
> all of the image pages as a contiguous stream of data, because that's needed
> by the user space and _can_ be used by the in-kernel code as well.

Okay. So the read or write terminology is from userspace's perspective, 
right? Perhaps that's part of what I've found confusing about the kernel 
side of things.

> [Note that snapshot_[read|write]_next() belong to the memory management
> part of the hibernation subsystem and that's exactly why they are in
> snapshot.c.]
>
> Now, compression can happen in two places: while the image is created
> or after it has been created (current behavior).  In the latter case, the image
> pages need not be compressed in place, they may be compressed after being
> returned by snapshot_read_next(), in a temporary buffer (that's now s2disk
> does that).  So you can arrange things like this:
>
> create image
> repeat:
> - snapshot_read_next() ->  buffer
> - if buffer is full, compress it (possibly encrypt it) and write the result to
>    the storage
>
> This way you'd just avoid all of the complications and I fail to see any
> drawbacks.
>
> Now, an attractive thing would be to compress data while creating the image
> and that may be done in the following way:
>
> have a buffer ready
> repeat:
> - copy image pages to the buffer (instead of copying them directly into the
>    image storage space)
> - if buffer is full, compress it and copy the result to the image storage
>    space, page by page.

A few points that might be worth considering:

Wouldn't compressing the image while creating it rather than while 
writing increase the overall time taken to hibernate (since the time 
taken can't then be combined with the time for writing the image)?

Wouldn't it also increase the memory requirements? I realise that this 
may not be an issue when we're thinking of freeing memory anyway, but if 
we eventually do the TuxOnIce thing of trying not to free memory and 
writing a two part image...

TuxOnIce has support for multithreaded compression, which is proving 
particularly useful as storage speeds up. At the moment, I have threads 
bound to secondary CPUs doing the compression and the boot cpu's thread 
coalescing their output into pages that are then written to disk 
(reverse at resume time, with readahead to maximise throughput). Doing 
the compression at image creation time would disallow multithreaded 
compression and slow things down.

Hope that helps.

Nigel
_______________________________________________
linux-pm mailing list
linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/linux-pm