Re: sync writes - expected performance?

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 14 Dec 2015 11:03:16 -0600

On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
Hello,

i'm doing some measuring on test (3 nodes) cluster and see strange performance
drop for sync writes..

I'm using SSD for both journalling and OSD. It should be suitable for
journal, giving about 16.1KIOPS (67MB/s) for sync IO.

(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test)

On top of this cluster, I have running KVM guest (using qemu librbd backend).
Overall performance seems to be quite good, but the problem is when I try
to measure sync IO performance inside the guest.. I'm getting only about 600IOPS,
which I think is quite poor.

The problem is, I don't see any bottlenect, OSD daemons don't seem to be hanging on
IO, neither hogging CPU, qemu process is also not somehow too much loaded..

I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging disabled,

my question is, what results I can expect for synchronous writes? I understand
there will always be some performance drop, but 600IOPS on top of storage which
can give as much as 16K IOPS seems to little..

So basically what this comes down to is latency.  Since you get 16K IOPS 
for O_DSYNC writes on the SSD, there's a good chance that it has a 
super-capacitor on board and can basically acknowledge a write as 
complete as soon as it hits the on-board cache rather than when it's 
written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO 
is completing in around 0.06ms on average.  That's very fast!  At 600 
IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per 
IO on average.

So how do we account for the difference?  Let's start out by looking at 
a quick example of network latency (This is between two random machines 
in one of our labs at Red Hat):

64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms

now consider that when you do a write in ceph, you write to the primary 
OSD which then writes out to the replica OSDs.  Every replica IO has to 
complete before the primary will send the acknowledgment to the client 
(ie you have to add the latency of the worst of the replica writes!). 
In your case, the network latency alone is likely dramatically 
increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to 
process crush mappings, look up directory and inode metadata on the 
filesystem where objects are stored (assuming it's not cached), and 
other processing time, and the 1.6ms latency for the guest writes starts 
to make sense.

Can we improve things?  Likely yes.  There's various areas in the code 
where we can trim latency away, implement alternate OSD backends, and 
potentially use alternate network technology like RDMA to reduce network 
latency.  The thing to remember is that when you are talking about 
O_DSYNC writes, even very small increases in latency can have dramatic 
effects on performance.  Every fraction of a millisecond has huge 
ramifications.

Has anyone done similar measuring?

thanks a lot in advance!

BR

nik

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com