On Tue, Jul 27, 2021 at 5:21 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 27.07.21 11:48, David Hildenbrand wrote: > > On 27.07.21 02:12, Evan Green wrote: > >> Add a new SWAP_FLAG_HIBERNATE_ONLY that adds a swap region but refuses > >> to allow generic swapping to it. This region can still be wired up for > >> use in suspend-to-disk activities, but will never have regular pages > >> swapped to it. This flag will be passed in by utilities like swapon(8), > >> usage would probably look something like: swapon -o hibernate /dev/sda2. > >> > >> Currently it's not possible to enable hibernation without also enabling > >> generic swap for a given area. One semi-workaround for this is to delay > >> the call to swapon() until just before attempting to hibernate, and then > >> call swapoff() just after hibernate completes. This is somewhat kludgy, > >> and also doesn't really work to keep swap out of the hibernate region. > >> When hibernate begins, it starts by allocating a large chunk of memory > >> for itself. This often ends up forcing a lot of data out into swap. By > >> this time the hibernate region is eligible for generic swap, so swap > >> ends up leaking into the hibernate region even with the workaround. > >> > >> There are a few reasons why usermode might want to be able to > >> exclusively steer swap and hibernate. One reason relates to SSD wearing. > >> Hibernate's endurance and speed requirements are different from swap. > >> It may for instance be advantageous to keep hibernate in primary > >> storage, but put swap in an SLC namespace. These namespaces are faster > >> and have better endurance, but cost 3-4x in terms of capacity. > >> Exclusively steering hibernate and swap enables system designers to > >> accurately partition their storage without either wearing out their > >> primary storage, or overprovisioning their fast swap area. > >> > >> Another reason to allow exclusive steering has to do with security. > >> The requirements for designing systems with resilience against > >> offline attacks are different between swap and hibernate. Swap > >> effectively requires a dictionary of hashes, as pages can be added and > >> removed arbitrarily, whereas hibernate only needs a single hash for the > >> entire image. If you've set up block-level integrity for swap and > >> image-level integrity for hibernate, then allowing swap blocks to > >> possibly leak out to the hibernate region is problematic, since it > >> creates swap pages not protected by any integrity. > >> > >> Swap regions with SWAP_FLAG_HIBERNATE_ONLY set will not appear in > >> /proc/meminfo under SwapTotal and SwapFree, since they are not usable as > >> general swap. These regions do still appear in /proc/swaps. > > > > Right, and they also don't account towards the memory overcommit > > calculations. > > > > Thanks for extending the patch description! No problem, thanks for all the brainwaves directed at this. > > > > [...] > > > >> + if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY) { > >> + if (IS_ENABLED(CONFIG_HIBERNATION)) { > >> + if (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS) > >> + return -EINVAL; > >> + > >> + } else { > >> + return -EINVAL; > >> + } > >> + } > > > > We could do short > > > > if ((swap_flags & SWAP_FLAG_HIBERNATE_ONLY) && > > (!IS_ENABLED(CONFIG_HIBERNATION) || > > (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS))) > > return -EINVAL; > > > > or > > > > if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY)) > > if (!IS_ENABLED(CONFIG_HIBERNATION) || > > (swap_flags & ~SWAP_HIBERNATE_ONLY_VALID_FLAGS)) > > return -EINVAL; > > > >> + > >> if (!capable(CAP_SYS_ADMIN)) > >> return -EPERM; > >> > >> @@ -3335,16 +3366,20 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > >> if (swap_flags & SWAP_FLAG_PREFER) > >> prio = > >> (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT; > >> + > >> + if (swap_flags & SWAP_FLAG_HIBERNATE_ONLY) > >> + p->flags |= SWP_HIBERNATE_ONLY; > >> enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map); > >> > >> - pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s%s\n", > >> + pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s%s%s\n", > >> p->pages<<(PAGE_SHIFT-10), name->name, p->prio, > >> nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10), > >> (p->flags & SWP_SOLIDSTATE) ? "SS" : "", > >> (p->flags & SWP_DISCARDABLE) ? "D" : "", > >> (p->flags & SWP_AREA_DISCARD) ? "s" : "", > >> (p->flags & SWP_PAGE_DISCARD) ? "c" : "", > >> - (frontswap_map) ? "FS" : ""); > >> + (frontswap_map) ? "FS" : "", > >> + (p->flags & SWP_HIBERNATE_ONLY) ? "H" : ""); > >> > >> mutex_unlock(&swapon_mutex); > >> atomic_inc(&proc_poll_event); > >> > > > > Looks like the cleanest alternative to me, as long as we don't want to > > invent new interfaces. > > > > Acked-by: David Hildenbrand <david@xxxxxxxxxx> > > > > Pavel just mentioned uswsusp, and I wonder if it would be a possible > alternative to this patch. I think you're right that it would be possible to isolate the hibernate image with uswsusp if you avoid using the SNAPSHOT_*SWAP* ioctls. But I'd expect performance to suffer noticeably, since now every page is making a round trip out to usermode and back. I'd still very much use the HIBERNATE_ONLY flag if it were accepted, I think there's value to it. -Evan