[LSF/MM TOPIC] Phasing out kernel thread freezing

"Luis R. Rodriguez" <mcgrof@xxxxxxxxxx> · Fri, 26 Jan 2018 10:09:23 +0100

Since the 2015 Kernel summit in South Korea we agreed that we should phase out
the kernel thread freezer. This was due to the fact that filesystem freezing
was originally added into the kernel to aid in going to suspend to ensure no
unwanted IO activity would cause filesystem corruption, and we could instead
replace this by using the already implemented filesystem suspend/thaw calls.

Filesystems are not the only users of the freezer API now though. Although
most uses outside of filesystems might be bogus, we're prone to hit many
regressions with a wide sweep removal. Actually phasing out kernel thread
freezing turns out to be trickier than expected even just in filesystems alone,
so the current approach is to slowly phase this out one step at time. One
subsystem and driver type at a time. Clearly the first subsystem we should
tackle is filesystems.

We now seems to have reached consensus on how to do this now for a few
filesystems which implement freeze_fs() only. The outstanding work I have has
to just do evaluation of the prospect use of sharing the same semantics to
freeze as with freeze_bdev(), initiated by dm, and a proper way to address
reference counting in a generic form for sb freezing. The only filesystems
which implement freeze_fs():

  o xfs
  o reiserfs
  o nilfs2
  o jfs
  o f2fs
  o ext4
  o ext2
  o btrfs

Of these, the following have freezer helpers, which can then be removed after
the kernel automaticaly calls freeze_fs for us on suspend:

  o xfs                                                                                                                                                                                       
  o nilfs2                                                                                                                                                                                    
  o jfs                                                                                                                                                                                       
  o f2fs                                                                                                                                                                                      
  o ext4 

Long term we need to decide what to do with filesystem which do not implement
freeze_fs(), or for instance filesystems which implement freeze_super(). Jan
Kara made a few suggestions I'll be evaluating soon to this regards, however
there are others special filesystem with other considerations though.  As an
example, for NFS Jeff Layton has suggested to have freeze_fs() make the RPC
engine "park" newly issued RPCs for that fs' client onto a rpc_wait_queue.  Any
RPC that has already been sent however, we need to wait for a reply. Once
everything is quiesced we can return and call it frozen.  unfreeze_fs can then
just have the engine stop parking RPCs and wake up the waitq. He however points
out that if we're interested in making the cgroup freezer also work, then we
may need to do a bit more work to ensure that we don't end up with frozen tasks
squatting on VFS locks. Dave Chinner however notes that cgroup is broken by
design *if* it requires tasks to be frozen without holding any VFS/filesystem
lock context, and as such we *should* be able to ignore it.

We also need to decide what to do with complex layered situations, for example
Bart Van Assche suggested considering the case of a filesystem that exists on
top of an md device where the md device uses one or more files as backing store
and with the loop driver between the md device and the files. Chinner has
suggested to allow block devices to freez superblocks on the block device,
however some *may* prefer to have a call to allow a superblock to quiesce the
underlying block device which would allow md/dm to suspend whatever on-going
maintenance operations it has in progress until the filesystem suggests it
needs to thaw. The pros / cons of both approaches should probably be discussed
unless its already crystal clear what path to take.

Finally, we should evaluate any other potential uses of the kernel freezer API
which now have grown dependent on it, even though the design for it was only to
help avoid filesystem corruption on our way to suspend. If none have really
become dependent on them, then great, we can just remove them one at a time
subsystem at a time to avoid regressions.

  Luis