Re: the need for a discoverable sub-volumes specification

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 19, 2021 at 4:17 AM Lennart Poettering
<lennart@xxxxxxxxxxxxxx> wrote:
>
> On Do, 18.11.21 14:51, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote:
>
> > How to do swapfiles?
>
> Is this really a concept that deserves too much attention?

*shrug* Only insofar as I like order, and like the idea of agreeing on
where things belong if there's going to appear somewhere.

> I mean, I
> have the suspicion that half the benefit of swap space is that it can
> act as backing store for hibernation.

Yes and that's a terrible conflation. The swapfile/device is for
anonymous pages. And hibernation images are not anon pages, and even
have special rules like must be contained in contiguous physical
device blocks. It may turn out that 'swsusp' (Swap Suspend) in the
kernel shouldn't be deprecated, and instead focus future effort on
'uswsusp'. But discussions around signed and authenticated hibernation
images for UEFI Secure Boot and kernel lockdown compatibility, have
all been around the kernel implementation.

https://www.kernel.org/doc/Documentation/power/swsusp.rst
https://www.kernel.org/doc/Documentation/power/userland-swsusp.rst


> But swap files are icky for that
> since that means the resume code has to mount the fs first, but given
> the fs is dirty during the hibernation state this is highly problematic.

It's sufficiently complicated and non-fail-safe (it's fail danger)
that it's broken. On btrfs, it's more tedious but less broken because
you must use both

resume=UUID=$uuid resume_offset=$physicaloffsethibernationimage

In effect the kernel does not need to mount ro the btrfs file system
at all, it gets the hint for the physical location of the hibernation
image from kernel boot parameter. Other file systems support discovery
of the physical offset once the file system is mounted ro. On Btrfs
you can see the swapfile as having a punch through mechanism. It's a
reservation of blocks, and page outs happen directly to that
reservation of blocks, not via the file system itself. This is why
there are all these limitations: balance doesn't touch block groups
containing any swapfile blocks, you can't do any kind of multiple
device stuff, you can't snapshot/reflink the swapfile, etc.

Which is why I'm in favor of just ceding this entire territory over to
systemd to manage correctly. But as a prerequisite, the hibernation
image should be separate from the swapfile. And should have a metadata
format so we can pair file system state to hibernation image state,
that way for sure we aren't running into catastrophic nonsense like
this right at the top of
https://www.kernel.org/doc/Documentation/power/swsusp.rst

   **BIG FAT WARNING**

   If you touch anything on disk between suspend and resume...
...kiss your data goodbye.

   If you do resume from initrd after your filesystems are mounted...
...bye bye root partition.

Horrible.

> Hence, I have the suspicion that if you do swap you should probably do
> swap partitions, not swap files, because it can cover all usecase:
> paging *and* hibernation.

I agree only insofar as it's the most reliable thing we have right
now. Not that it's an efficient or safe design, you still can have
problems if you rw mount a file system, and then resume from a
hibernation image. The kernel has no concept of matching a file system
state to that of a hibernation image, so that the hibernation image
can be invalidated, thus avoiding subsequent corruption.

> > Currently I'm creating a "swap" subvolume in the top-level of the file
> > system and /etc/fstab looks like this
> >
> > UUID=$FSUUID    /var/swap               btrfs   noatime,subvol=swap 0 0
> > /var/swap/swapfile1 none swap defaults 0 0
> >
> > This seems to work reliably after hundreds of boots.
> >
> > a. Is this naming convention for the subvolume adequate? Seems like it
> > can just be "swap" because the GPT method is just a single partition
> > type GUID that's shared by multiboot Linux setups, i.e. not arch or
> > distro specific
>
> I'd still put it one level down, and marke it with some non-typical
> character so that it is less likely to clash with anything else.

I'm not sure I understand "one level down". The "swap" subvolume would
be in the top-level of the Btrfs file system, just like Fedora's
existing "root" and "home" subvolumes are in the top level.

>
> > b. Is the mount point, /var/swap, OK?
>
> I see no reason why not.

OK super.


>
> > c. What should the additional naming convention be for the swapfile
> > itself so swapon happens automatically?
>
> To me it appears these things should be distinct: if automatic
> activation of swap files is desirable, then there should probably be a
> systemd generator that finds all suitable files in /var/swap/ and
> generates .swap units for them. This would then work with any kind of
> setup, i.e. independently of the btrfs auto-discovery stuff. The other
> thing would be the btrfs auto-disocvery to then actually mount
> something there automatically.

I think it's desirable only because users on even Android don't have
to F around with swap management, let alone on Windows and macOS. I
don't care to change the current interfaces, so that users can keep
doing what they want, but establish best practice, and automate it
(i.e. take it completely away from the user having to think about it,
let alone set it up or clean up after it).


>
> > Also, instead of /@auto/ I'm wondering if we could have
> > /x-systemd.auto/ ? This makes it more clearly systemd's namespace, and
> > while I'm a big fan of the @ symbol for typographic history reasons,
> > it's being used in the subvolume/snapshot regimes rather haphazardly
> > for different purposes which might be confusing? e.g. Timeshift
> > expects subvolumes it manages to be prefixed with @. Meanwhile SUSE
> > uses @ for its (visible) root subvolume in which everything else goes.
> > And still ZFS uses @ for their (read-only) snapshots.
>
> I try to keep the "systemd" name out of entirely generic specs, since
> there are some people who have an issue with that. i.e. this way we
> tricked even Devuan to adopt /etc/os-release and the /run/ hierarchy,
> since they probably aren't even aware that these are systemd things.
>
> Other chars could be used too: /+auto/ sounds OK to me too. or
> /_auto/, or /=auto/ or so.

OK fine.


-- 
Chris Murphy



[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux