Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Thu, 6 May 2010 15:57:45 +0100

Rusty Russell wrote:
> > Seems over-zealous.
> > If the recovery_header held a strong checksum of the recovery_data you would
> > not need the first fsync, and as long as you have two places to write recovery
> > data, you don't need the 3rd and 4th syncs.
> > Just:
> >   write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> >   fsync / msync
> >   overwrite_with_new_data()
> > 
> > To recovery you choose the most recent log_space and replay the content.
> > That may be a redundant operation, but that is no loss.
> 
> I think you missed a checksum for the new data?  Otherwise we can't tell if
> the new data is completely written.

The data checksum can go in the recovery-data block.  If there's
enough slack in the log, by the time that recovery-data block is
overwritten, you can be sure that an fsync has been done for that
data (by a later commit).

> But yes, I will steal this scheme for TDB2, thanks!

Take a look at the filesystems.  I think ext4 did some optimisations
in this area, and that checksums had to be added anyway due to a
subtle replay-corruption problem that happens when the log is
partially corrupted, and followed by non-corrupt blocks.

Also, you can remove even more fsyncs by adding a bit of slack to the
data space and writing into unused/fresh areas some of the time -
i.e. a bit like btrfs/zfs or anything log-structured, but you don't
have to go all the way with that.

> In practice, it's the first sync which is glacial, the rest are pretty cheap.

The 3rd and 4th fsyncs imply a disk seek each, just because the
preceding writes are to different areas of the disk.  Seeks are quite
slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap?
That it's only a couple of seeks, or that you don't see even that?

> 
> > Also cannot see the point of msync if you have already performed an fsync,
> > and if there is a point, I would expect you to call msync before
> > fsync... Maybe there is some subtlety there that I am not aware of.
> 
> I assume it's this from the msync man page:
> 
>        msync()  flushes  changes  made  to the in-core copy of a file that was
>        mapped into memory using mmap(2) back to disk.   Without  use  of  this
>        call  there  is  no guarantee that changes are written back before mun‐
>        map(2) is called. 

Historically, that means msync() ensures dirty mapping data is written
to the file as if with write(), and that mapping pages are removed or
refreshed to get the effect of read() (possibly a lazy one).  It's
more obvious in the early mmap implementations where mappings don't
share pages with the filesystem cache, so msync() has explicit
behaviour.

Like with write(), after calling msync() you would then call fsync()
to ensure the data is flushed to disk.

If you've been calling fsync then msync, I guess that's another fine
example of how these function are so hard to test, that they aren't.

Historically on Linux, msync has been iffy on some architectures, and
I'm still not sure it has the same semantics as other unixes.  fsync
as we know has also been iffy, and even now that fsync is tidier it
does not always issue a hardware-level cache commit.

But then historically writable mmap has been iffy on a boatload of
unixes.

> > > It's an implementation detail; barrier has less flexibility because it has
> > > less information about what is required. I'm saying I want to give you as
> > > much information as I can, even if you don't use it yet.
> > 
> > Only we know that approach doesn't work.
> > People will learn that they don't need to give the extra information to still
> > achieve the same result - just like they did with ext3 and fsync.
> > Then when we improve the implementation to only provide the guarantees that
> > you asked for, people will complain that they are getting empty files that
> > they didn't expect.
> 
> I think that's an oversimplification: IIUC that occurred to people *not*
> using fsync().  They weren't using it because it was too slow.  Providing
> a primitive which is as fast or faster and more specific doesn't have the
> same magnitude of social issues.

I agree with Rusty.  Let's make it perform well so there is no reason
to deliberately avoid using it, and let's make say what apps actually
want to request without being way too strong.

And please, if anyone has ideas on how we could make correct use of
these functions *testable* by app authors, I'm all ears.  Right now it
is quite difficult - pulling power on hard disks mid-transaction is
not a convenient method :)

> > The abstraction I would like to see is a simple 'barrier' that contains no
> > data and has a filesystem-wide effect.
> 
> I think you lack ambition ;)
> 
> Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
> suboptimal for md?  Since you have to hand your barrier to every device
> whereas a file-wide primitive may theoretically only go to some.

Yes.

Note that database-like programs still need fsync-like behaviour
*sometimes*: The "D" in ACID depends on it, and the "C" in ACID also
depends on it where multiple files are involved which must contain
consistent data with each other after crash/recovery (Perhaps Samba
depends on this?)

Single-file sync is valuable just like single-file barrier, and so is
the combination.

Since you mentioned ambition, think about multi-file updates.  They're
analogous in userspace to MD's barrier/sync requirements in
kernelspace.

One API that supports multi-file update barriers is "long aio-fsync":
Something which returns when the data in earlier writes (to one file)
is committed, but does not force the commit to happen more quickly
than normal.  Both single-file barriers (like you want for TDB) and
multi-file barriers can be implemented on top of that, but it's much
more difficult to use than an fbarrier() syscall, which is only
suitable for single-file.  But I wonder if there would be many users
of fbarrier() who aren't perfectly capable of using something else if
needed.

-- Jamie
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/virtualization