Below are my notes about this session at LSF/MM 2018. kthread freezer woes ==================== Icebreakers: * When was the kthread primitive introduced and who introduced it? What was the original purpose of the kthreads API? * What was the first try_to_freeze() kernel user? * Do *you* run into issues with the kthread freezer? * We're addressing such woes with filesystems first Motivation ---------- The kernel kthread freezer API is rather loose, and even with a lot of years of evolution over *how* to properly freeze kthreads, there are still issues that creep up. One goal suggested long ago by Jiri Kosina was to *not* have to call try_to_freeze() on kthreads all over the kernel and instead replace it with more appropriate infrastructure per subsystem. Long term we want to address this throughout the kernel, however, we'll start off focusing on filesystems first. Each other subsystem will have to address this on their own but perhaps they can get some ideas of what to do from the filesystems work. Example of a modern kthread freezer issue ----------------------------------------- A regression was detected on XFS with suspend, on a hibernation stress test after 48 rounds of cycling it failed. > Reverting 18f1df4e00ce ("xfs: Make xfsaild freezeable again") > would break the proper form of the kthread for it to be freezable. > This "form" is not defined formally, and sadly its just a form > learned throughout years over different kthreads in the kernel. Dave Chinner later noted: "Suspend on journalling filesystems has been broken for a long time (i.e since I first realised the scope of the problem back in 2005)" "IOWs, suspend of filesystems has been broken forever, and we've been slapping bandaids on it in XFS forever." https://lkml.kernel.org/r/20171114212538.GC4094@dastard Components to understand the the issue -------------------------------------- * refrigerator() * try_to_freeze() * kthread_freezable_should_stop() * kthread_run() The core of the issue: If a freezable kernel thread fails to call try_to_freeze() after the freezer has initiated a freezing operation, the freezing of tasks will fail and the entire hibernation or suspend operation will be cancelled. refrigerator() -------------- Commit 542f96a52 ("[PATCH] suspend-to-{RAM,disk}") by Pavel Machek <pavel@xxxxxx> on v2.5.18 added suspend-to-RAM/disk support, and as part of it, it added the refrigerator(). It carried heavy warnings for a good reason: /Documentation/swsusp.txt BIG FAT WARNING If you have unsupported (*) devices using DMA... ...say goodbye to your data. If you touch anything on disk between suspend and resume... ...kiss your data goodbye. If your disk driver does not support suspend... (IDE does) ...you'd better find out how to get along without your data. (*) pm interface support is needed to make it safe. # refrigerator() in a nutshell void refrigerator(unsigned long flag) { ... while (current->flags & PF_FROZEN) schedule(); ... } kthread_run() ------------- When was the kthread primitive introduced and who introduced it? Rusty Russell <rusty@xxxxxxxxxxxxxxx> via linux-history commit 933ba10234f68 ("[PATCH] kthread primitive") on the v2.6.4 release. Original motivation was to enable CPU hotplug. Managing tasks properly in light of CPU hotplug is hard, the kthread primitive helps with this. Uses kernel_thread() behind the scenes -- the kernel equivalent to a fork() Don't freeze kthreads on try_to_freeze_tasks() ---------------------------------------------- We don't want kernel threads to be frozen in unexpected places, so we allow them to block freeze_processes(), or to set PF_NOFREEZE if needed. KTW_FREEZABLE exists to enable kthread work freezing but no users exist. static void create_kthread(struct kthread_create_info *create) { ... /* We want our own signal handler (we take no signals by default). */ pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD); ... } Also: bool freeze_task(struct task_struct *p) { ... if (!(p->flags & PF_KTHREAD)) fake_signal_wake_up(p); else wake_up_state(p, TASK_INTERRUPTIBLE); ... } Since kthreads want their own signal handler we won't wake them up with the above signal, and we also have that extra PF_KTHREAD check -- just in case. kthreads need to have control over how they are frozen. freezer_do_not_count() - don't freeze current --------------------------------------------- Just keep in mind its not just kthreads. kthreads are not the only things which avoids the general freeze_processes(). freezer_do_not_count() sets PF_NOFREEZE and these proceesses will also skip the general freeze. Currently set on: * do_fork() on wait_for_vfork_done() * do_coredump() on coredump_wait() * binder_thread_read() on binder_wait_for_work() First try_to_freeze() --------------------- Commit 54820fb26 ("[PATCH] swsusp: try_to_freeze() to make freezing hooks nicer") added as of v2.6.11 by Pavel Machek <pavel@xxxxxx> added try_to_freeze() API for the kernel scheduler. # First try_to_freeze() kernel user What was the first try_to_freeze() user? It was on the x86 Intel IO-APIC kernel IRQ balancer: static int __init balanced_irq_init(void) { ... printk(KERN_INFO "Starting balanced_irq\n"); if (kernel_thread(balanced_irq, NULL, CLONE_KERNEL) >= 0) return 0; else printk(KERN_ERR "balanced_irq_init: failed to spawn balanced_irq"); ... } Modified later to use the kthread API, so after commit f26d6a2bbcf38 ("[PATCH] i386: convert to the kthread API") merged on v2.6.22 this looked like: static int __init balanced_irq_init(void) { ... printk(KERN_INFO "Starting balanced_irq\n"); if (!IS_ERR(kthread_run(balanced_irq, NULL, "kirqd"))) return 0; printk(KERN_ERR "balanced_irq_init: failed to spawn balanced_irq"); ... } Anyway, this IRQ balancer was later removed via commit 8b8e8c1bf7275e ("x86: remove irqbalance in kernel for 32 bit") merged on v2.6.28 since the userspace irqbalanced deprecated this. Early try_to_freeze() users --------------------------- Code was simplified all around that used similar semantics with try_to_freeze() via commit f9adcf4ea1599 ("[PATCH] swsusp: refrigerator cleanups") added on v2.6.11 by Pavel Machek <pavel@xxxxxx>. What are the try_to_freeze() early users? Determined by using linux-history with: git checkout -b linux-2.6.11 v2.6.11 git grep try_to_freeze * architecture do_signal() calls -- for example on x86: diff --git a/arch/x86_64/kernel/signal.c b/arch/x86_64/kernel/signal.c index ad3b240cdd9c..1cb237ad1fcc 100644 --- a/arch/x86_64/kernel/signal.c +++ b/arch/x86_64/kernel/signal.c @@ -24,7 +24,6 @@ #include <linux/stddef.h> #include <linux/personality.h> #include <linux/compiler.h> -#include <linux/suspend.h> #include <asm/ucontext.h> #include <asm/uaccess.h> #include <asm/i387.h> @@ -423,10 +422,8 @@ int do_signal(struct pt_regs *regs, sigset_t *oldset) return 1; } - if (current->flags & PF_FREEZE) { - refrigerator(0); + if (try_to_freeze(0)) goto no_signal; - } if (!oldset) oldset = ¤t->blocked; Commit fc558a7496bf ("[PATCH] swsusp: finally solve mysqld problem") by Rafael J. Wysocki <rjw@xxxxxxx> on v2.6.17 moved try_to_freeze() to kernel/signal.c on get_signal_to_deliver(), now get_signal(). * pcmcia on the pccardd kthread -- still present today * USB core hub_thread() * filesystmes * mm pdflush() context -- the original writeback daemon, "dirty page" flush pdflush added on v2.5.8. Prior to this we had bdflush(), and that was kicked off if try_to_free_buffers() was called on a page and it was determined not all the buffers of a page could be freed. Simple daemon to provide a dynamic response to dirty buffers. Its job was to writeback a limited number of buffers to disk and go back to sleep again. * sunrpc svc_recv() kthread_freezable_should_stop() ------------------------------- Commit 8a32c441c1609 ("freezer: implement and use kthread_freezable_should_stop()" added via v3.3 by Tejun. This commit claims you should *not* use try_to_freeze(), instead you *should* use kthread_freezable_should_stop() if your kthread is freezable.... But yet... only 4 kernel users! The only filesystem using it is NFS on nfs4_callback_svc(). What gives? Getting the kthread freezer right --------------------------------- At the 2015 Kernel summit at South Korea, Jiri Kosina suggested we should phase it out long term. The semantics are loose and experience shows that it is difficult to get right. Best to phase it out if possible. What to do - address filesystems first -------------------------------------- Phasing out the kthread freezer is hard so lets divide and conquer. Let's first address this on filesystems properly. Simple: freeze_fs() for filesystems that implement the callback Patches are ongoing, so expect a new series soon for filesystem. But what's left after this? Filesystems with freeze_fs() and those using try_to_freeze() ------------------------------------------------------------ The current patches being worked on address filesystems which implement freeze_fs(). Let's review these. The filesystems which implement freeze_fs(): * xfs * reiserfs * nilfs2 * jfs * f2fs * ext4 * ext2 * btrfs Of these, the following have freezer helpers, which can then be removed after the kernel automaticaly calls freeze_fs() for us on suspend: * xfs * nilfs2 * jfs * f2fs * ext4 What about other filesystems? Order considerations -------------------- Current simple solution: iterate_supers_reverse_excl() on freeze iterate_supers() for thaw We acknowledged at LSF/MM that this order is not sufficient and definitely not perfect. We would need a proper infrastructure to have a Directed Acyclic Graph (DAG) we can rely on. We do not have that today. We discussed possibly adding infrastructure for this. We determined that this work is long term -- we're fine to move on with life with the above simple implementation for now and acknowledge the imperfect solution should work in most cases. Al Viro gave an example where the order is not respected today. You can use a loopback device to mount a file image on a filesystem A, and at a later point in time dynamically change the backing device to another file present on filesystem B using the loopback ioctl LOOP_CHANGE_FD -- loop_change_fd(). /* * loop_change_fd switched the backing store of a loopback device to * a new file. This is useful for operating system installers to free up * the original file and in High Availability environments to switch to * an alternative location for the content in case of server meltdown. * This can only work if the loop device is used read-only, and if the * new backing store is the same size and type as the old backing store. */ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, unsigned int arg) { ... } The ioctl then currently enables the order for the a superblock to change dynamically and break the above order assumptions to be correct. It was mentioned it is believed some distributions (Fedora?) installers may use this to complete installation and give control to the system withot rebooting. If true, then suspend order *could* break in this case. Should a flag be added to disable use of freeze_fs() if such ioctl is used? Al didn't seem too happy with the idea of working around this by moving the superblock around to the right order after the ioctl work. One of the issues with the violation of this ordering is that its not possible to detect violations to odering, given if you could do this, you may be able to address ordering. Its a catch-22 situation. However, if there *are* known cases where such violations do happen, one option is can skip the suspend framework for filesystems later. Long term we want proper infrastructure to address ordering, which can address this example corner case. Other possible use cases for a DAG on the superblocks is emergency remount. Possible issues --------------- David Howells noted future possible races with suspend/freeze and automount. Should automount be skipped during the general suspend/resume cycle? NFS --- For NFS Jeff Layton has suggested to have freeze_fs() make the RPC engine "park" newly issued RPCs for that fs' client onto a rpc_wait_queue. Any RPC that has already been sent however, we need to wait for a reply. unfreeze_fs() can then just have the engine stop parking RPCs and wake up the waitq. He however points out that if we're interested in making the cgroup freezer also work, then we may need to do a bit more work to ensure that we don't end up with frozen tasks squatting on VFS locks. Is it true that cgroups freezer want tasks to not hold VFS locks prior to freezing? Dave Chinner notes that freezing a filesystem pretty much guarantees the opposite - that tasks *will freeze when holding VFS locks* - the cgroup freezer is broken by design *if* it requires tasks to be frozen without holding any VFS/filesystem lock context, and as such we *should* be able to ignore it. At LSF/MM it was acknowledged cgroup freezer may be broken, folks who would care are aware, and a long term fix is needed. Why the device model cannot be used for filesystems --------------------------------------------------- Darick suggested that since the filesystem ordering is a layering problem we should consider applying the device model to filesystems / block layer. Turns out suspend of filesystems cannot be tied to suspend of devices. # Do not use device model - breaks hibernation Hibernation is one example of an issue. Filesystems ordering is different that device ordering on a system. During hibernation devices are suspended/resumed twice. When you hibernate you need to freeze the filesystem, create the image, resume all devices to then be able to store the image, and then suspend devices again. Device ordering is not representative of the ordering in which you mount filesystems. That has its own order, and suspending by bus order can be different than the mount order. Suspending filesystems in the incorrect mount order may yield in the inability for one filesystem to properly flush all pending IO. The ordering is not something which is only needed for suspend and hibernation though. We'll need proper ordering for snapshotting, and if we later grow the idea of snapshotting on files for the notion of subvolumes, odering may play an important role here as well? The current order strategy proposed is simply to iterate in reverse on all superblocks, this should *in most cases* reflect the inverse of mount order. The superblock on top will be the last mounted filesystem. Filesystems on loopback freeze before the lower filesystem that hosts the loopback image. What about kernel filesystem which do not implement freeze_fs()? ---------------------------------------------------------------- Issuing a generic sync for filesystems which implement freeze_super() may suffice. cgroup notes ------------ Recall from Documentation/cgroup-v1/freezer-subsystem.txt: $ echo $$ 16644 $ bash $ echo $$ 16690 From a second, unrelated bash shell: $ kill -SIGSTOP 16690 $ kill -SIGCONT 16690 <at this point 16690 exits and causes 16644 to exit too> This happens because bash can observe both signals and choose how it responds to them. FUSE drivers ------------ The implementation is only making use of filesystems which implement freeze_fs() callback, it doesn't ouch other filesystems. It should therefore technically not regress FUSE filesystems. But there are still problems which need to be addressed on FUSE. # New mount API with userspace freeze handler One idea is that the kernel can notify userspace about the freeze and let it deal with what it has to. There is a new mount call API being evolved, one idea is to provide a userspace freeze hanlder call with this new mount API. It was punted that perhaps some of these things are ideas which can be worked on under the Project Springfield umbrella [0]. [0] https://springfield-project.github.io/ # FUSE drivers and dirty data https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html Paul Mackerras: "What happens if there is an encfs filesystem mounted, and we freeze all userspace processes and then do a sys_sync()? Does the system wait forever for the encfs filesystem to write out its dirty data?" Yes. Technically sync() and the freezer should in theory address most considerations? # An example broken approach One could iterate over an XFS filesystems or FUSE filesystem and do something like, but *this* script below is fragile and should be broken. Consider the cgroup notes above, but this is also horribly slow and fragile. This uses the filesystem freeze ioctl but also sends SIGSTOP/SIGCONT to processes. For FUSE it could use whatever freeze op FUSE drivers come up with. But again, horribly broken and why this needs proper infrastructure. #!/bin/bash set -e XFS_FREEZE="/usr/sbin/xfs_freeze" PROC_MOUNTS="/proc/mounts" SUSPEND_SIGNAL="SIGSTOP" error_quit() { echo "$1" >&2 exit 1 } check-system() { [ -r "${PROC_MOUNTS}" ] || error_quit "ERROR: cannot find or read ${PROC_MOUNTS}" [ -x "${XFS_FREEZE}" ] || error_quit "ERROR: cannot find or execute ${XFS_FREEZE}" } run-fs-freeze() { local i FSTYPE MNT ROOTDEV ARGS while read ROOTDEV MNT FSTYPE ARGS; do [ "$ROOTDEV" = "rootfs" ] && continue [ "$MNT" = "/" ] && continue case $FSTYPE in xfs) echo " Trying to freeze userspace processes on '$FSTYPE' mounted on ${MNT}... " for i in $(lsof +D $MNT -t 2>/dev/null); do kill -$SUSPEND_SIGNAL $i; done echo -n " Trying to freeze fstype '$FSTYPE' mounted on ${MNT}... " $XFS_FREEZE -f $MNT echo "OK!" ;; *) ;; esac done < $PROC_MOUNTS } run-fs-unfreeze() { local i FSTYPE MNT ROOTDEV ARGS while read ROOTDEV MNT FSTYPE ARGS; do [ "$ROOTDEV" = "rootfs" ] && continue [ "$MNT" = "/" ] && continue case $FSTYPE in xfs) echo " Trying to unfreeze userspace processes on '$FSTYPE' mounted on ${MNT}... " for i in $(lsof +D $MNT -t 2>/dev/null); do kill -SIGCONT $i; done echo -n " Trying to unfreeze fstype '$FSTYPE' mounted on ${MNT}... " $XFS_FREEZE -u $MNT echo "OK!" ;; *) ;; esac done < $PROC_MOUNTS } check-system if [ "$2" = suspend ]; then echo "INFO: running $0 for $2" else echo "INFO: running $0 for $2" fi if [ "$1" = pre ] ; then run-fs-freeze fi if [ "$1" = post ] ; then run-fs-unfreeze fi Future ideas ------------ # Flushing Flushing is *not* fully needed as per Rafael, and he suggests this actually creates high latencies for suspend. Addressing this as per Jan Kara is complex. Jan Kara also notes that switching the fs freeze implementation to avoid sync(2) if asked to is quite independent from implementing system suspend to use fs freezing. Can patches to make 'background buffered writeback not suck' help with future suspend sync latencies? Probably not? Jens Axboe's patches: https://lwn.net/Articles/681763/ https://lkml.org/lkml/2016/3/23/310 What about dynamically throttling the amount of writeback allowed through a grace period, moments prior to suspend? -- Do not panic