Re: [LSF/MM TOPIC] Phasing out kernel thread freezing

"Rafael J. Wysocki" <rafael@xxxxxxxxxx> · Sun, 25 Feb 2018 10:45:26 +0100

On Sat, Feb 24, 2018 at 4:27 AM, Luis R. Rodriguez <mcgrof@xxxxxxxxxx> wrote:
> On Mon, Feb 05, 2018 at 09:28:37AM +0100, Rafael J. Wysocki wrote:
>> On Sun, Feb 4, 2018 at 11:41 PM, Bart Van Assche <Bart.VanAssche@xxxxxxx> wrote:
>> > On Wed, 2018-01-31 at 11:10 -0800, Darrick J. Wong wrote:
>> >> For a brief moment I pondered whether it would make sense to make
>> >> filesystems part of the device model so that the suspend code could work
>> >> out fs <-> bdev dependencies and know in which order to freeze
>> >> filesystems and quiesce devices, but every time I go digging into how
>> >> all those macros work I get confused and my eyes glaze over, so I don't
>> >> know if this is at all a good idea or just confused ramblings.
>> >
>> > If we have to go this way: shouldn't we introduce a new abstraction
>> > ("storage stack element" or similar) rather than making filesystems part of
>> > the device model?
>>
>> That would be my approach.
>>
>> Trying to "suspend" filesystems at the same time as I/O devices (and
>> all of that asynchronously) may be problematic for ordering reasons
>> and similar.
>
> Oh look, another ordering issue. And this is why I was not a fan of the
> device link API even though that is what we got merged. Moving on...
>
>> Moreover, during hibernation devices are suspended for two times (and
>> resumed in between, of course) whereas filesystems only need to be
>> "suspended" once.
>
> From your point of view yes, but actually internally the VFS layer or
> filesystems themselves may end up re-using this mechanism later for
> other things like -- snapshotting. And if some folks have it the way
> they want it, we may need a dependency map between filesystems anyway
> for filesystem specific reasons.

That's orthogonal to what I said.

A dependency map between filesystems and other components of the block
layer (like md, dm etc) will be necessary going forward (if all of the
suspending and resuming of them is expected to be reliable anyway),
but that doesn't change hibernation-related requirements one whit.

Filesystems need to be suspended (or frozen or whatever terminology
ends up being used for that) *before* creating a hibernation image and
they *cannot* be resumed (unfrozen etc) after that until the system is
off or the kernel decides that the hibernation has failed and rolls
back.  Whatever data/metadata are there in persistent storage before
the image is created, changing them after that point is potentially
critically harmful, so (in the hibernation case) all of the in-flight
I/O that may end up being written to persistent storage needs to be
flushed before creating the image.

However, *devices* are resumed after creating the image so that the
image itself can be written to persistent storage and are suspended
after that again before putting the system to sleep (for wakeup to
work, among other things).

That's why suspend/resume of filesystems cannot be tied to
suspend/resume of devices.

Note that this isn't the case for system suspend/resume
(suspend-to-RAM or suspend-to-idle).

>> With that in mind, I would add a mechanism allowing filesystems (and
>> possibly other components of the storage stack) to register a set of
>> callbacks for suspend and resume and then invoking those callbacks in
>> a specific order.
>
> That's what I had done in my series, the issue here is order. Order in my
> series is simple but should work for starters, later however I suspect we'll
> need something more robust to help.

Quite likely.

Thanks,
Rafael