Re: Synchronous writes - tuning and some thoughts about them?

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 3 Jun 2015 13:15:42 +0200

Thanks for a very helpful answer.
So if I understand it correctly then what I want (crash consistency with RPO>0) isn’t possible now in any way.
If there is no ordering in RBD cache then ignoring barriers sounds like a very bad idea also.

Any thoughts on ext4 with journal_async_commit? That should be safe in any circumstance, but it’s pretty hard to test that assumption…

Is there someone running big database (OLTP) workloads on Ceph? What did you do to make them run? Out of box we are all limited to the same ~100 tqs/s (with 5ms write latency)…

Jan

> On 03 Jun 2015, at 02:08, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> 
> On 06/01/2015 03:41 AM, Jan Schermer wrote:
>> Thanks, that’s it exactly.
>> But I think that’s really too much work for now, that’s why I really would like to see a quick-win by using the local RBD cache for now - that would suffice for most workloads (not too many people run big databases on CEPH now, those who do must be aware of this).
>> 
>> The issue is - and I have not yet seen an answer to that - would it be safe as it is now if the flushes were ignored (rbd cache = unsafe) or will it completely b0rk the filesystem when not flushed properly?
> 
> Generally the latter. Right now flushes are the only thing enforcing
> ordering for rbd. As a block device it doesn't guarantee that e.g. the
> extent at offset 0 is written before the extent at offset 4096 unless
> it sees a flush between the writes.
> 
> As suggested earlier in this thread, maintaining order during writeback
> would make not sending flushes (via mount -o nobarrier in the guest or
> cache=unsafe for qemu) safer from a crash-consistency point of view.
> 
> An fs or database on top of rbd would still have to replay their
> internal journal, and could lose some writes, but should be able to
> end up in a consistent state that way. This would make larger caches
> more useful, and would be a simple way to use a large local cache
> devices as an rbd cache backend. Live migration should still work in
> such a system because qemu will still tell rbd to flush data at that
> point.
> 
> A distributed local cache like [1] might be better long term, but
> much more complicated to implement.
> 
> Josh
> 
> [1] https://www.usenix.org/conference/fast15/technical-sessions/presentation/bhagwat
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com