Re: sync writes - expected performance?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
Hello,

i'm doing some measuring on test (3 nodes) cluster and see strange performance
drop for sync writes..

I'm using SSD for both journalling and OSD. It should be suitable for
journal, giving about 16.1KIOPS (67MB/s) for sync IO.

(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test)

On top of this cluster, I have running KVM guest (using qemu librbd backend).
Overall performance seems to be quite good, but the problem is when I try
to measure sync IO performance inside the guest.. I'm getting only about 600IOPS,
which I think is quite poor.

The problem is, I don't see any bottlenect, OSD daemons don't seem to be hanging on
IO, neither hogging CPU, qemu process is also not somehow too much loaded..

I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging disabled,

my question is, what results I can expect for synchronous writes? I understand
there will always be some performance drop, but 600IOPS on top of storage which
can give as much as 16K IOPS seems to little..

So basically what this comes down to is latency. Since you get 16K IOPS for O_DSYNC writes on the SSD, there's a good chance that it has a super-capacitor on board and can basically acknowledge a write as complete as soon as it hits the on-board cache rather than when it's written to flash. Figure that for 16K O_DSYNC IOPs means that each IO is completing in around 0.06ms on average. That's very fast! At 600 IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per IO on average.

So how do we account for the difference? Let's start out by looking at a quick example of network latency (This is between two random machines in one of our labs at Red Hat):

64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms

now consider that when you do a write in ceph, you write to the primary OSD which then writes out to the replica OSDs. Every replica IO has to complete before the primary will send the acknowledgment to the client (ie you have to add the latency of the worst of the replica writes!). In your case, the network latency alone is likely dramatically increasing IO latency vs raw SSD O_DSYNC writes. Now add in the time to process crush mappings, look up directory and inode metadata on the filesystem where objects are stored (assuming it's not cached), and other processing time, and the 1.6ms latency for the guest writes starts to make sense.

Can we improve things? Likely yes. There's various areas in the code where we can trim latency away, implement alternate OSD backends, and potentially use alternate network technology like RDMA to reduce network latency. The thing to remember is that when you are talking about O_DSYNC writes, even very small increases in latency can have dramatic effects on performance. Every fraction of a millisecond has huge ramifications.


Has anyone done similar measuring?

thanks a lot in advance!

BR

nik




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux