Synchronous writes - tuning and some thoughts about them?

nick@xxxxxxxxxx (Nick Fisk) · Mon, 25 May 2015 18:58:09 +0100

Hi Jan,

I share your frustrations with slow sync writes. I'm exporting RBD's via iSCSI to ESX, which seems to do most operations in 64k sync IO's. You can do a fio run and impress yourself with the numbers that you can get out of the cluster, but this doesn't translate into what you can achieve when using sync writes with a client.

I have too been experimenting with flashcache/enhanceio with the goal to use Dual Port SAS SSD's to allow for HA iSCSI gateways. Currently I'm just testing with a single iSCSI server and see a massive improvement. I'm interested in the corruptions you have been experiencing on host crashes, are you implying that you think flashcache is buffering writes before submitting them to the SSD? When watching its behaviour using iostat it looks like it submits everything in 4k IO's to the SSD which to me looks like it is not buffering.

I did raise a topic a few months back asking about the possibility of librbd supporting persistent caching to SSD's, which would allow write back caching regardless if the client requests a flush. Although there was some interest in the idea, I didn't get the feeling it would be at the top of anyone's priority's.

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 25 May 2015 09:59
> To: ceph-users at lists.ceph.com
> Subject: Synchronous writes - tuning and some thoughts about
> them?
> 
> Hi,
> I have a full-ssd cluster on my hands, currently running Dumpling, with plans
> to upgrade soon, and Openstack with RBD on top of that. While I am overall
> quite happy with the performance (scales well accross clients), there is one
> area where it really fails bad - big database workloads.
> 
> Typically, what a well-behaved database does is commit to disk every
> transaction before confirming it, so on a ?typical? cluster with a write latency
> of 5ms (with SSD journal) the maximum number of transactions per second
> for a single client is 200 (likely more like 100 depending on the filesystem).
> Now, that?s not _too_ bad when running hundreds of small databases, but
> it?s nowhere near the required performance to subsitute an existing SAN or
> even just a simple RAID array with writeback cache.
> 
> First hope was that enabling RBD cache will help - but it really doesn?t
> because all the flushes (O_DIRECT writes) end on the drives and not in the
> cache. Disabling barriers in the client helps, but that makes it not crash
> consistent (unless one uses ext4 with journal_checksum etc., I am going to
> test that soon).
> 
> Are there any plans to change this behaviour - i.e. make the cache a real
> writeback cache?
> 
> I know there are good reasons not to do this, and I commend the developers
> for designing the cache this way, but real world workloads demand shortcuts
> from time to time - for example MySQL with its InnoDB engine has an option
> to only commit to disk every Nth transaction - and this is exactly the kind of
> thing I?m looking for. Not having every confirmed transaction/write on the
> disk is not a huge problem, having a b0rked filesystem is, so this should be
> safe as long as I/O order is preserved. Sadly, my database is not an InnoDB
> where I can tune something, but an enterprise behemoth that traditionally
> runs on FC arrays, it has no parallelism (that I could find), and always uses
> O_DIRECT for txlog.
> 
> (For the record - while the array is able to swallow 30K IOps for a minute,
> once the cache is full it slows to ~3 IOps, while CEPH happily gives the same
> 200 IOps forever, bottom line is you always need more disks or more cache,
> and your workload should always be able to run without the cache anyway  -
> even enterprise arrays fail, and write cache is not always available, contrary
> to popular belief).
> 
> Is there some option that we could use right now to turn on a true writeback
> caching? Losing a few transactions is fine as long as ordering is preserved.
> I was thinking ?cache=unsafe? but I have no idea whether I/O order is
> preserved with that.
> I already mentioned turning off barriers, which could be safe in some setups
> but needs testing.
> Upgrading from Dumpling will probably help with scaling, but will it help write
> latency? I would need to get from 5ms/write to <1ms/write.
> I investigated guest-side caching (enhanceio/flashcache) but that fails really
> bad when the guest or host crashes - lots of corruption. EnhanceIO in
> particular looked very nice and claims to respect barriers? not in my
> experience, though.
> 
> It might seem that what I want is evil, and it really is if you?re running a
> banking database, but for most people this is exactly what is missing to make
> their workloads run without having some sort of 80s SAN system in their
> datacentre, I think everyone here would appreciate that :-)
> 
> Thanks
> 
> Jan
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com