Re: sync writes - expected performance?

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 14 Dec 2015 23:07:26 +0100

Even with 10G ethernet, the bottleneck is not the network, nor the drives (assuming they are datacenter-class). The bottleneck is the software.
The only way to improve that is to either increase CPU speed (more GHz per core) or to simplify the datapath IO has to take before it is considered durable.
Stuff like RDMA will help only if there so zero-copy between the (RBD) client and the drive, or if the write is acknowledged when in the remote buffers of replicas (but it still has to come from client directly or RDMA becomes a bit pointless, IMHO).

Databases do sync writes for a reason, O_DIRECT doesn't actually make strong guarantees on ordering or buffering, though in practice the race condition is negligible.

Your 600 IOPS are pretty good actually.

Jan

> On 14 Dec 2015, at 22:58, Warren Wang - ISD <Warren.Wang@xxxxxxxxxxx> wrote:
> 
> Whoops, I misread Nikola¹s original email, sorry!
> 
> If all your SSDs are all performing at that level for sync IO, then I
> agree that it¹s down to other things, like network latency and PG locking.
> Sequential 4K writes with 1 thread and 1 qd is probably the worst
> performance you¹ll see. Is there a router between your VM and the Ceph
> cluster, or one between Ceph nodes for the cluster network?
> 
> Are you using dsync at the VM level to simulate what a database or other
> app would do? If you can switch to directIO, you¹ll likely get far better
> performance. 
> 
> Warren Wang
> 
> 
> 
> 
> On 12/14/15, 12:03 PM, "Mark Nelson" <mnelson@xxxxxxxxxx> wrote:
> 
>> 
>> 
>> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>>> Hello,
>>> 
>>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>> performance
>>> drop for sync writes..
>>> 
>>> I'm using SSD for both journalling and OSD. It should be suitable for
>>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>> 
>>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>> --group_reporting --name=journal-test)
>>> 
>>> On top of this cluster, I have running KVM guest (using qemu librbd
>>> backend).
>>> Overall performance seems to be quite good, but the problem is when I
>>> try
>>> to measure sync IO performance inside the guest.. I'm getting only
>>> about 600IOPS,
>>> which I think is quite poor.
>>> 
>>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>> be hanging on
>>> IO, neither hogging CPU, qemu process is also not somehow too much
>>> loaded..
>>> 
>>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>> disabled,
>>> 
>>> my question is, what results I can expect for synchronous writes? I
>>> understand
>>> there will always be some performance drop, but 600IOPS on top of
>>> storage which
>>> can give as much as 16K IOPS seems to little..
>> 
>> So basically what this comes down to is latency.  Since you get 16K IOPS
>> for O_DSYNC writes on the SSD, there's a good chance that it has a
>> super-capacitor on board and can basically acknowledge a write as
>> complete as soon as it hits the on-board cache rather than when it's
>> written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>> is completing in around 0.06ms on average.  That's very fast!  At 600
>> IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per
>> IO on average.
>> 
>> So how do we account for the difference?  Let's start out by looking at
>> a quick example of network latency (This is between two random machines
>> in one of our labs at Red Hat):
>> 
>>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>>> 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
>>> 64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
>>> 64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
>>> 64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
>> 
>> now consider that when you do a write in ceph, you write to the primary
>> OSD which then writes out to the replica OSDs.  Every replica IO has to
>> complete before the primary will send the acknowledgment to the client
>> (ie you have to add the latency of the worst of the replica writes!).
>> In your case, the network latency alone is likely dramatically
>> increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to
>> process crush mappings, look up directory and inode metadata on the
>> filesystem where objects are stored (assuming it's not cached), and
>> other processing time, and the 1.6ms latency for the guest writes starts
>> to make sense.
>> 
>> Can we improve things?  Likely yes.  There's various areas in the code
>> where we can trim latency away, implement alternate OSD backends, and
>> potentially use alternate network technology like RDMA to reduce network
>> latency.  The thing to remember is that when you are talking about
>> O_DSYNC writes, even very small increases in latency can have dramatic
>> effects on performance.  Every fraction of a millisecond has huge
>> ramifications.
>> 
>>> 
>>> Has anyone done similar measuring?
>>> 
>>> thanks a lot in advance!
>>> 
>>> BR
>>> 
>>> nik
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com