Re: FIFREEZE on loop device does not return EBUSY

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 2 Oct 2024 10:10:08 +1000

On Tue, Oct 01, 2024 at 03:20:43PM +0200, Jean-Louis Dupond wrote:
> Hi All,
> 
> I've been investigating a hang/freeze of some of our VM's when running a
> snapshot on them.
> The cause seems to be that the fsFreeze call before the snapshot gets locked
> forever due to the use of a loop device.
> 
> There are multiple reports about this, like [1] or [2].
> Also there is a quite easy way to simulate it, see [3].
> 
> Now this seems to happen when you do a ioctl FIFREEZE on a mount point that
> is backed by a loop device.
> For ex:
> /dev/loop0      3.9G  508K  3.7G   1% /tmp
> 
> And loop0 is:
> /dev/loop0: [2052]:25308501 (/usr/tmpDSK)
> 
> Now if you lock the disk/partition that has /usr, and then you want to lock
> /tmp, it will hang forever (or until you thaw the /usr disk).

Yup, this is the behaviour freezing the backing filesystem of a loop
device has had since freezing filesystems was implemented 20 years
ago.

So, really, this is expected behaviour...

> You would expect that the call returns -EBUSY like the others, but that is
> not the case.

I certainly don't expect this to return -EBUSY - the /tmp filesystem
has not been frozen, and so it should not return -EBUSY to an
attempt to freeze it. It should be frozen, and because that requires
IO to be done, the behaviour of the operation is dependent on how
the lower layers process the IO that the filesystem issues.

> Is this something we want to solve? Or does somebody have better idea's on
> how to resolve this?

The actual process of freezing complex data caching heirarchies is
beyond the awareness of the kernel. To create consistent snapshot
images of a filesystem, userspace applications need to be flushing
cached data before the filesystem is frozen (i.e. so they suspend to
a coherent state for device snapshots and backups). Then the
filesystem can suspend, and once all write IO is done, the block
device can be suspended. This is a top-down process - it has to be
done in this order.

However, loop devices mean that suspend events first need to
propagate upwards to the top of the heirarchy before any suspend
operation is started so that the entire suspend can be done from the
top down. I think FUSE also introduces complex heirarchies which
could possibly include loops, and I suspect overlay can introduce
them, too.

Hence walking to the top of the heirarchy before we can start a
suspend operation on a filesystem involves interacting with
userspace policy, configuration and applications. This really can
only be done reliably from userspace.

This is especially true when we consider thawing that heirarchy.
What is the kernel supposed to do if it gets a thaw request in the
middle of such a nested suspend? Is it supposed to cascade up and
down the heirarchy? Maybe just up? Or maybe it should be ignored
because it's not at the bottom of the freeze graph? Or maybe we
should allow overrides in case of emergencies if userspace loses
track of what its frozen?

This rapidly gets complex if we try to handle all these potential
policy considerations from a completely context-free environment
inside the kernel. We cannot make the right choices for all cases
where nested freezes might or might not be required by userspace.
It's easier to say "don't do that" than it is to try to solve such a
complex problem with code.

Ultimately, I suspect that FIFREEZE needs to issue a blocking
fanotify event so that userspace can capture such requests and apply
whatever policy the user wants for nested block device hierarchies
and the applications on top of them before the filesystem itself
gets frozen....

> The Qemu issue is already a long standing issue, which I want to get
> resolved :)

It's a long standing issue because it's a complex issue that can't
really be solved by adding code to the kernel alone. Loop device
management is done by userspace, not the kernel, and there is no
one policy for freezing that applies to every situation....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx