Re: [LSF/MM TOPIC] Phasing out kernel thread freezing

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Wed, 31 Jan 2018 11:10:16 -0800

On Fri, Jan 26, 2018 at 10:09:23AM +0100, Luis R. Rodriguez wrote:
> Since the 2015 Kernel summit in South Korea we agreed that we should phase out
> the kernel thread freezer. This was due to the fact that filesystem freezing
> was originally added into the kernel to aid in going to suspend to ensure no
> unwanted IO activity would cause filesystem corruption, and we could instead
> replace this by using the already implemented filesystem suspend/thaw calls.
> 
> Filesystems are not the only users of the freezer API now though. Although
> most uses outside of filesystems might be bogus, we're prone to hit many
> regressions with a wide sweep removal. Actually phasing out kernel thread
> freezing turns out to be trickier than expected even just in filesystems alone,
> so the current approach is to slowly phase this out one step at time. One
> subsystem and driver type at a time. Clearly the first subsystem we should
> tackle is filesystems.
> 
> We now seems to have reached consensus on how to do this now for a few
> filesystems which implement freeze_fs() only. The outstanding work I have has
> to just do evaluation of the prospect use of sharing the same semantics to
> freeze as with freeze_bdev(), initiated by dm, and a proper way to address
> reference counting in a generic form for sb freezing. The only filesystems
> which implement freeze_fs():
> 
>   o xfs
>   o reiserfs
>   o nilfs2
>   o jfs
>   o f2fs
>   o ext4
>   o ext2
>   o btrfs
> 
> Of these, the following have freezer helpers, which can then be removed after
> the kernel automaticaly calls freeze_fs for us on suspend:
>                                                                                                                                                                                               
>   o xfs                                                                                                                                                                                       
>   o nilfs2                                                                                                                                                                                    
>   o jfs                                                                                                                                                                                       
>   o f2fs                                                                                                                                                                                      
>   o ext4 
> 
> Long term we need to decide what to do with filesystem which do not implement
> freeze_fs(), or for instance filesystems which implement freeze_super(). Jan
> Kara made a few suggestions I'll be evaluating soon to this regards, however
> there are others special filesystem with other considerations though.  As an
> example, for NFS Jeff Layton has suggested to have freeze_fs() make the RPC
> engine "park" newly issued RPCs for that fs' client onto a rpc_wait_queue.  Any
> RPC that has already been sent however, we need to wait for a reply. Once
> everything is quiesced we can return and call it frozen.  unfreeze_fs can then
> just have the engine stop parking RPCs and wake up the waitq. He however points
> out that if we're interested in making the cgroup freezer also work, then we
> may need to do a bit more work to ensure that we don't end up with frozen tasks
> squatting on VFS locks. Dave Chinner however notes that cgroup is broken by
> design *if* it requires tasks to be frozen without holding any VFS/filesystem
> lock context, and as such we *should* be able to ignore it.
> 
> We also need to decide what to do with complex layered situations, for example
> Bart Van Assche suggested considering the case of a filesystem that exists on
> top of an md device where the md device uses one or more files as backing store
> and with the loop driver between the md device and the files. Chinner has
> suggested to allow block devices to freez superblocks on the block device,
> however some *may* prefer to have a call to allow a superblock to quiesce the
> underlying block device which would allow md/dm to suspend whatever on-going
> maintenance operations it has in progress until the filesystem suggests it
> needs to thaw. The pros / cons of both approaches should probably be discussed
> unless its already crystal clear what path to take.

For a brief moment I pondered whether it would make sense to make
filesystems part of the device model so that the suspend code could work
out fs <-> bdev dependencies and know in which order to freeze
filesystems and quiesce devices, but every time I go digging into how
all those macros work I get confused and my eyes glaze over, so I don't
know if this is at all a good idea or just confused ramblings.

Maybe it would suffice to start freezing in reverse order of mount and
have some way to tell the underlying bdev that it should
flush/quiesce/whatever itself?

--D

> Finally, we should evaluate any other potential uses of the kernel freezer API
> which now have grown dependent on it, even though the design for it was only to
> help avoid filesystem corruption on our way to suspend. If none have really
> become dependent on them, then great, we can just remove them one at a time
> subsystem at a time to avoid regressions.
> 
>   Luis