On Mon, Feb 26, 2018 at 9:19 AM, Gionatan Danti <g.danti@xxxxxxxxxx> wrote: > Full disclaimer: maybe my point of view is influenced by thinking in the > context of Qemu/KVM + software RAID (where much works was done to be sure > about proper barrier passing) or BBU/NV hardware RAID. > > Il 26-02-2018 01:25 Dave Chinner ha scritto: >> >> Acknowledged sync writes are not guaranteed to be stable. They may >> still be sitting in volatile caches below the backing file, and so >> until there is a cache flush pushed down through all layers of the >> storage stack (e.g. fsync on the backing file) those acknowledged >> sync writes are not stable. That's one of the things quiescing the >> filesystem guarantees, but running reflink to clone the file does >> not. > > > Sure, but not-passed-down fsync/write barriers will thwarts even "normal" > (ie: not CoW/snapshotted/reflinked) sync writes, and will inevitably cause > problems (ie: a power loss become a big problem). How is it different for > relinked copy? > >> IOWs, "properly written" is easy to say but very hard to guarantee. >> We cannot make such assumptions about random user configs, nor we >> can base recommendations on such assumptions. If you choose not to >> quiesce the filesystems before snapshotting them, then it's your >> responsibility to guarantee your storage stack will work correctly. > > > Absolutely, and I *really* appreciate your advices. > >> You still have to quiesce the filesystem when it's on top of a LVM >> snapshot volume. > > > When the LVM volume is passed to a guest VM, the host can not quiesce the > filesystem. Host/guest communication can be achieved by the mean on a guest > agent and a private control channel, but this has its own problems. I > thoroughly tested live, LVM-backed snapshotted VM and every time I run them, > the guest filesystem replies its log without problem. I always double-check > that the entire I/O stack (from guest down to the physical disks) honors > write barriers, though. > > Back to the original question: if a reflinked copy is an *atomic* operation > on all the data extents comprising a file, and in the context of properly > passed barriers/fsync, I would think that an unquiesced snapshot will work > for the (reduced) consistency model of a crash-consistent snapshot. > > If the reflink copy is not atomic (ie: the different extents are CoWed at > different time, making it only a "faster copy" rather than a snapshot) this > will *not* work and I will end with binary garbage (ie: writes can be > reordered from snapshot's view). > > I think all can be reduced to a single question: putting aside quiescing > problems, is a reflinked copy a true *atomic* snapshot or it is "only" a > faster copy? > Gionatan, First of all, the answer to your question is "just" faster copy. reflinkning a file is much faster than copy, but it is not O(1). I believe cp --reflink can result in cloning part of the file if the system crashes mid operation, so in any case, the operation is not *atomic* in that sense. But your questions about quiescence the filesystem and your question about the *atomic* nature of the clone operation are two very different questions. What you seem to *think* xfs reflink does, it does not actually do. xfs reflink does NOT reflink the file in-memory data. xfs reflink "only" reflinks the file on-disk data. Right now, if you write a large file without fsync and clone it, you might as well get a clone of unallocated or partly fallocated file with zero or stale data. Going forward, I think there is an intention to "clone" the file in-memory data as well by sharing the READONLY cache pages between cloned files, but I don't think dirty pages are going be shared between clones anyway, so you are back to square one - need to get the data on-disk before cloning the file. Cheers, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html