On Thu, 14 Apr 2022, Ming Lei wrote: > On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote: > > On Tue, 22 Mar 2022, Eric Wheeler wrote: > > > On Mon, 21 Mar 2022, Ming Lei wrote: > > > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > > > > > Hello all, > > > > > > > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > > > > > it does not appear that lo_req_flush() does anything to make sure > > > > > ki_complete has been called for pending work, it just calls vfs_fsync(). > > > > > > > > > > Is this a consistency problem? > > > > > > > > No. What FLUSH command provides is just flushing cache in device side to > > > > storage medium, so it is nothing to do with pending request. > > > > > > If a flush follows a series of writes, would it be best if the flush > > > happened _after_ those writes complete? Then then the storage medium will > > > be sure to flush what was intended to be written. > > > > > > It seems that this series of events could lead to inconsistent data: > > > loop -> filesystem > > > write a > > > write b > > > flush > > > write a > > > flush > > > write b > > > crash, b is lost > > > > > > If write+flush ordering is _not_ important, then can you help me > > > understand why? > > > > > > > Hi Ming, just checking in: did you see the message above? > > > > Do you really mean to say that reordering writes around a flush is safe > > in the presence of a crash? > > Sorry, replied too quick. Thats ok, thanks for considering the issue! > BTW, what is the actual crash? Any dmesg log? From the above description, b is > just not flushed to storage when running flush, and sooner or later it will > land, so what is the real issue? In this case "crash" is actually a filesystem snapshot using most any snapshot technology such as: dm-thin, btrfs, bcachefs, zfs. From the filesystem inside the loop file's point of view, a snapshot is the same as a hardware crash. Some background: We've already seen journal commit re-ordering caused by dm-crypt in dm-thin snapshots: dm-thin -> dm-crypt -> [kvm with a disk using ext4] This is the original email about dm-crypt ordering: https://listman.redhat.com/archives/dm-devel/2016-September/msg00035.html We "fixed" the dm-crypt issue by disabling parallel dm-crypt threads with dm-crypt flags same_cpu_crypt+submit_from_crypt_cpus and haven't seen the issue since. (Its noticably slower, but I'll take consistency over performance any day! Not sure if that old dm-crypt ordering has been fixed.) So back to considering loop devs: Having seen the dm-crypt issue I would like to verify that loop isn't susceptable to the same issue in the presence of lower-level snapshots---but it looks like it could be since flushes don't enforce ordering. Here is why: In ext4/super.c:ext4_sync_fs(), the ext4 code calls blkdev_issue_flush(sb->s_bdev) when barriers are enabled (which is default). blkdev_issue_flush() sets REQ_PREFLUSH and according to blk_types.h this is a "request for cache flush"; you could think of in-flight IO's on the way through loop.ko and into the hypervisor filesystem where the loop's backing file lives as being in a "cache" of sorts---especially for non-DIO loopdevs hitting the pagecache. Thus, ext4 critically expects that all IOs preceding a flush will hit persistent storage before all future IOs. Failing that, journal replay cannot return the filesystem to a consistent state without a `fsck`. (Note that ext4 is just an example, really any journaled filesystem is at risk.) Lets say a virtual machine uses a loopback file for a disk and the VM issues the following to delete some file called "/b": unlink("/b"): write: journal commit: unlink /b flush: blkdev_issue_flush() write: update fs datastructures (remove /b from a dentry) <hypervisor snapshot> If the flush happens out of order then an operation like unlink("/b") could look like this where the guest VM's filesystem is on the left and the hypervisor's loop filesystem operations are on the right: VM ext4 filesystem | Hypervisor loop dev ordering --------------------------------+-------------------------------- write: journal commit: unlink /b flush: blkdev_issue_flush() write: update fs dentry's queued to loop: [journal commit: unlink /b] queued to loop: [update fs dentry's] flush: vfs_fsync() - out of order queued to ext4: [journal commit: unlink /b] queued to ext4: [update fs dentry's] write lands: [update fs dentry's] <snapshot!> write lands: [journal commit: unlink /b] Notice that the last two "write lands" are out of order because the vfs_fsync() does not separate them as expected by the VM's ext4 implementation. Presently, loop.c will never re-order actual WRITE's: they will be submitted to loopdev's `file*` handle in-submit-order because the workqueue will keep them ordered. This is good[*]. But, REQ_FLUSH is not a write: The vfs_fsync() in lo_req_flush() is _not_ ordered by the writequeue, and there exists no mechanism in loop.c to enforce completion of IOs submitted to the loopdev's `file*` handle prior to completing the vfs_fsync(), nor are subsequent IOs thereby required to complete after the flush. Thus, the hypervisor's snapshot-capable filesystem can re-order the last two writes because the flush happened early. In the re-ordered case on the hypervisor side: If a snapshot happens after the dentry removal but _before_ the journal commit, then a journal replay of the resulting snapshot will be inconsistent. Flush re-ordering creates an inconsistency in two possible cases: a. In the snapshot after dentry removal but before journal commit. b. Crash after dentry removal but before journal comit. Really a snapshot looks like a crash to the filesystem, so (a) and (b) are equivalent but (a) is easier to reason about. In either case, mounting the out-of-order filesystem (snapshot if (a), origin if (b)) will present kernel errors in the VM when the dentry is read: kernel: EXT4-fs error (device dm-2): ext4_lookup:1441: inode #1196093: comm rsync: deleted inode referenced: 1188710 [ https://listman.redhat.com/archives/dm-devel/2016-September/028121.html ] Fixing flush ordering provides for two possible improvements: ([*] from above about write ordering) 1. Consistency, as above. 2. Performance. Right now loopdev IOs are serialized by a single write queue per loopdev. Parallel work queues could be used to submit IOs in parallel to the filesystem serving the loopdev's `file*` handle since VMs may submit IOs from different CPU cores. Special care would need to be taken for the following cases: a. Ordering of dependent IOs (ie, reads or writes of preceding writes). b. Flushes need to wait for all workqueues to quiesce. W.r.t choosing the number of WQ's: Certainly not 1:1 CPU to workqueue mapping because of the old WQ issue running out of pid's with lots of CPUs, but here are some possibilities: a. 1:1 per socket b. User configurable as a module parameter c. Dedicated pool of workqueues for all loop devs that dispatch queued IOs to the correct `file*` handle. RCU might be useful. What next? Ok, so assuming consistency is an issue, #1 is the priority and #2 is nice-to-have. This might be the right time for to consider these since there is so much discussion about loop.c on the list right now. According to my understanding of the research above this appears to be an issue and there are other kernel developers who know this code better than I. I want to know if this is correct: Should others be CC'ed on this topic? If so, who? -- Eric Wheeler > > > Thanks, > Ming > >