Re: Urgent Help Needed (regarding rbd cache)

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Thu, 1 Aug 2019 14:25:06 +0200

Hi together,

Am 01.08.19 um 08:45 schrieb Janne Johansson:
Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid <junaid.fsd.pk@xxxxxxxxx <mailto:junaid.fsd.pk@xxxxxxxxx>>:

    Your email has cleared many things to me. Let me repeat my understanding. Every Critical data (Like Oracle/Any Other DB) writes will be done with sync, fsync flags, meaning they will be only confirmed to DB/APP after it is actually written to Hard drives/OSD's. Any other application can do it also.
    All other writes, like OS logs etc will be confirmed immediately to app/user but later on written  passing through kernel, RBD Cache, Physical drive Cache (If any)  and then to disks. These are susceptible to power-failure-loss but overall things are recoverable/non-critical. 

That last part is probably simplified a bit, I suspect between a program in a guest sending its data to the virtualised device, running in a KVM on top of an OS that has remote storage over network, to a storage server with its own OS and drive controller chip and lastly physical drive(s) to store the write, there will be something like ~10 layers of write caching possible, out of which the RBD you were asking about, is just one.

It is just located very conveniently before the I/O has to leave the KVM host and go back and forth over the network, so it is the last place where you can see huge gains in the guests I/O response time, but at the same time possible to share between lots of guests on the KVM host which should have tons of RAM available compared to any single guest so it is a nice way to get a large cache for outgoing writes.

Also, to answer your first part, yes all critical software that depend heavily on write ordering and integrity is hopefully already doing write operations that way, asking for sync(), fsync() or fdatasync() and similar calls, but I can't produce a list of all programs that do. Since there already are many layers of delayed cached writes even without virtualisation and/or ceph, applications that are important have mostly learned their lessons by now, so chances are very high that all your important databases and similar program are doing the right thing.

Just to add on this: One such software, for which people cared a lot, is of course a file system itself. BTRFS is notably a candidate very sensitive to broken flush / FUA ( https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA) ) implementations at any layer of the I/O path due to the rather complicated metadata structure.
While for in-kernel and other open source software (such as librbd), there are usually a lot of people checking the code for a correct implementation and testing things, there is also broken hardware
(or rather, firmware) in the wild.

But there are even software issues around, if you think more general and strive for data correctness (since also corruption can happen at any layer):
I was hit by an in-kernel issue in the past (network driver writing network statistics via DMA to the wrong memory location - "sometimes")
corrupting two BTRFS partitions of mine, and causing random crashes in browsers and mail client apps. BTRFS has been hardened only in kernel 5.2 to check the metadata tree before flushing it to disk.

If you are curious about known hardware issues, check out this lengthy, but very insightful mail on the linux-btrfs list:
https://lore.kernel.org/linux-btrfs/20190623204523.GC11831@xxxxxxxxxxxxxx/
As you can learn there, there are many drive and firmware combinations out there which do not implement flush / FUA correctly and your BTRFS may be corrupted after a power failure. The very same thing can happen to Ceph,
but with replication across several OSDs and lower probability to have broken disks in all hosts makes this issue less likely.

For what it is worth, we also use writeback caching for our virtualization cluster and are very happy with it - we also tried pulling power plugs on hypervisors, MONs and OSDs at random times during writes and ext4 could always recover easily with an fsck
making use of the journal.

Cheers and HTH,
	Oliver

But if the guest is instead running a mail filter that does antivirus checks, spam checks and so on, operating on files that live on the machine for something like one second, and then either get dropped or sent to the destination mailbox somewhere else, then having aggressive write caches would be very useful, since the effects of a crash would still mostly mean "the emails that were in the queue were lost, not acked by the final mailserver and will probably be resent by the previous smtp server". For such a guest VM, forcing sync writes would only be a net loss, it would gain much by having large ram write caches.

--
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com