On Tue, Oct 01, 2024 at 03:20:43PM +0200, Jean-Louis Dupond wrote: > Hi All, > > I've been investigating a hang/freeze of some of our VM's when running a > snapshot on them. > The cause seems to be that the fsFreeze call before the snapshot gets locked > forever due to the use of a loop device. > > There are multiple reports about this, like [1] or [2]. > Also there is a quite easy way to simulate it, see [3]. > > Now this seems to happen when you do a ioctl FIFREEZE on a mount point that > is backed by a loop device. > For ex: > /dev/loop0 3.9G 508K 3.7G 1% /tmp > > And loop0 is: > /dev/loop0: [2052]:25308501 (/usr/tmpDSK) > > Now if you lock the disk/partition that has /usr, and then you want to lock > /tmp, it will hang forever (or until you thaw the /usr disk). Yup, this is the behaviour freezing the backing filesystem of a loop device has had since freezing filesystems was implemented 20 years ago. So, really, this is expected behaviour... > You would expect that the call returns -EBUSY like the others, but that is > not the case. I certainly don't expect this to return -EBUSY - the /tmp filesystem has not been frozen, and so it should not return -EBUSY to an attempt to freeze it. It should be frozen, and because that requires IO to be done, the behaviour of the operation is dependent on how the lower layers process the IO that the filesystem issues. > Is this something we want to solve? Or does somebody have better idea's on > how to resolve this? The actual process of freezing complex data caching heirarchies is beyond the awareness of the kernel. To create consistent snapshot images of a filesystem, userspace applications need to be flushing cached data before the filesystem is frozen (i.e. so they suspend to a coherent state for device snapshots and backups). Then the filesystem can suspend, and once all write IO is done, the block device can be suspended. This is a top-down process - it has to be done in this order. However, loop devices mean that suspend events first need to propagate upwards to the top of the heirarchy before any suspend operation is started so that the entire suspend can be done from the top down. I think FUSE also introduces complex heirarchies which could possibly include loops, and I suspect overlay can introduce them, too. Hence walking to the top of the heirarchy before we can start a suspend operation on a filesystem involves interacting with userspace policy, configuration and applications. This really can only be done reliably from userspace. This is especially true when we consider thawing that heirarchy. What is the kernel supposed to do if it gets a thaw request in the middle of such a nested suspend? Is it supposed to cascade up and down the heirarchy? Maybe just up? Or maybe it should be ignored because it's not at the bottom of the freeze graph? Or maybe we should allow overrides in case of emergencies if userspace loses track of what its frozen? This rapidly gets complex if we try to handle all these potential policy considerations from a completely context-free environment inside the kernel. We cannot make the right choices for all cases where nested freezes might or might not be required by userspace. It's easier to say "don't do that" than it is to try to solve such a complex problem with code. Ultimately, I suspect that FIFREEZE needs to issue a blocking fanotify event so that userspace can capture such requests and apply whatever policy the user wants for nested block device hierarchies and the applications on top of them before the filesystem itself gets frozen.... > The Qemu issue is already a long standing issue, which I want to get > resolved :) It's a long standing issue because it's a complex issue that can't really be solved by adding code to the kernel alone. Loop device management is done by userspace, not the kernel, and there is no one policy for freezing that applies to every situation.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx