Re: Notes on block I/O data integrity

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Christopher,

thanks a lot vor this overview, it answers a lot of my questions!
May I suggest You put it somewhere on the wiki so it doesn't get 
forgotten in the maillist only?
It also rises few new questions though. We have experienced postgresql
database corruptions lately, two times to be exact. First time, I blamed
server crash, but lately (freshly created) database got corrupted for the 
second time and there were no crashes since the initialisation. The server
hardware is surely OK. I didn't have much time to look into this
yet, but Your mail just poked me to return to the subject. The situation
is a bit more complex, as there are additional two layers of storage there:
we're using SATA/SAS drives, network-mirrored by DRBD, clustered LVM on top
of those, and finally qemu-kvm using virtio on top of created logical
volumes. So there are plenty of possible culprits, but Your mention of virtio
unsafeness while using cache=writethrough (which is the default for drive 
types other then qcow) leads me to suspicion that this might be the reason of 
the problem. Databases are sensitive for requests reordering, so I guess
using virtio for postgres storage was quite stupid from me :(
So my question is, could You please advise me a bit on the storage
configuration? virtio performed much better then SCSI, but of course
data integrity is crucial, so would You suggest rather using SCSI?
DRBD doesn't have problem with barriers, clustered LVM SHOULD not 
have problems with it, as we're using just striped volumes, but I'll
check it to be sure. So is it safe for me to keep cache=writethrough
for the database volume?

thanks a lor in advance for any hints!

with best regards

nik

On Tue, Aug 25, 2009 at 08:11:20PM +0200, Christoph Hellwig wrote:
> As various people wanted to know how the various data integrity patches
> I've send out recently play together here's a small writeup on what
> issues we have in QEMU and how to fix it:
> 
> There are two major aspects of data integrity we need to care in the
> QEMU block I/O code:
> 
>  (1) stable data storage - we must be able to force data out of caches
>      onto the stable media, and we must get completion notification for it.
>  (2) request ordering - we must be able to make sure some I/O request
>      do not get reordered with other in-flight requests before or after
>      it.
> 
> Linux uses two related abstractions to implement the this (other operating
> system are probably similar)
> 
>  (1) a cache flush request that flushes the whole volatile write cache to
>      stable storage
>  (2) a barrier request, which
>       (a) is guaranteed to actually go all the way to stable storage
>       (b) does not reordered versus any requests before or after it
> 
> For disks not using volatile write caches the cache flush is a no-op,
> and barrier requests are implemented by draining the queue of
> outstanding requests before the barrier request, and only allowing new
> requests to proceed after it has finished.  Instead of the queue drain
> tag ordering could be used, but at this point that is not the case in
> Linux.
> 
> For disks using volatile write caches, the cache flush is implemented by
> a protocol specific request, and the the barrier request are implemented
> by performing cache flushes before and after the barrier request, in
> addition to the draining mentioned above.  The second cache flush can be
> replaced by setting the "Force Unit Access" bit on the barrier request 
> on modern disks.
> 
> 
> The above is supported by the QEMU emulated disks in the following way:
> 
>   - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/
>     WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not
>     indicate a volatile write cache in the ATA IDENTIFY command.  Because
>     of that guests do no not actually send down cache flush request.  Linux
>     guests do however drain the I/O queues to guarantee ordering in absence
>     of volatile write caches.
>   - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command,
>     and also advertises the write cache enabled bit.  This means Linux
>     sends down cache flush requests to implement barriers, and provides
>     sufficient queue draining.
>   - The virtio-blk driver does not implement any cache flush command.
>     And while there is a virtio-blk feature bit for barrier support
>     it is not support by virtio-blk.  Due to the lack of a cache flush
>     command it also is insufficient to implement the required data
>     integrity semantics.  Currently the virtio-blk Linux does not
>     advertise any form of barrier support, and we don't even get the
>     queue draining required for proper operation in a cache-less
>     environment.
> 
> The I/O from these front end drivers maps to different host kernel I/O
> patterns  depending on the cache= drive command line.  There are three
> choices for it:
> 
>  (a) cache=writethrough
>  (b) cache=writeback
>  (c) cache=none
> 
> (a) means all writes are synchronous (O_DSYNC), which means the host
>     kernel guarantees us that data is on stable storage once the I/O
>     request has completed.
>     In cache=writethrough mode the IDE and SCSI drivers are safe because
>     the queue is properly drained to guarantee I/O ordering.  Virtio-blk
>     is not safe due to the lack of queue draining.
> (b) means we use regular buffered writes and need a fsync/fdatasync to
>     actually guarantee that data is stable on disk.
>     In data=writeback mode on the SCSI emulation is safe as all others
>     miss the cache flush requests.
> (c) means we use direct I/O (O_DIRECT) to bypass the host cache and
>     perform direct dma to/from the I/O buffer in QEMU.  While direct I/O
>     bypasses the host cache it does not guarantee flushing of volatile
>     write caches in disks, nor completion of metadata operations in
>     filesystems (e.g. block allocations).
>     In data=none only the SCSI emulation is entirely safe right now
>     due to the lack of cache flushes in the other drivers.
> 
> 
> Action plan for the guest drivers:
> 
>  - virtio-blk needs to advertise ordered queue by default.
>    This makes cache=writethrough safe on virtio.
> 
> Action plan for QEMU:
> 
>  - IDE needs to set the write cache enabled bit
>  - virtio needs to implement a cache flush command and advertise it
>    (also needs a small change to the host driver)
>  - we need to implement an aio_fsync to not stall the vpu on cache
>    flushes
>  - investigate only advertising a write cache when we really have one
>    to avoid the cache flush requests for cache=writethrough
> 
> Notes on disk cache flushes on Linux hosts:
> 
>  - barrier requests and cache flushes are supported by all local
>    disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
>    However unlike the other filesystems ext3 does _NOT_ enable barriers
>    and cache flush requests by default.
>  - currently O_SYNC writes or fsync on block device nodes does not
>    flush the disk cache.
>  - currently none of the filesystems nor the direct access to the block
>    device nodes implements flushes of the disk caches when using
>    O_DIRECT|O_DSYNC or using fsync/fdatasync after an O_DIRECT request.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux