Re: Very low 4k randread performance ~1000iops

Stephen Mercier <stephen.mercier@xxxxxxxxxxxx> · Tue, 30 Jun 2015 12:15:57 -0700

I currently have about 250 VMs, ranging from 16GB to 2TB in size. What I found, after about a week of testing, sniffing, and observing, is that the larger read ahead buffer causes the VM to chunk reads over to ceph, and in doing so, allows it to better align with the 4MB block size that Ceph uses. If I dropped the cache below 16MB, performance would degrade, almost linearly, all the way down to the 16kb standard size. And when I increased it above 16MB, there were some intermittent gains, but overall nothing to write home about.
For reference, our ceph cluster is 88TB, spread across 88 1TB SSDs. Each storage node has 100GbE connectivity, and each cloud host (proxmox) has 40GbE. I'm able to sustain 3400 iops regularly, and seen spikes as high as 5200+ iops in our calamari logs. In addition, due to some clever use of LACP and mlag, I'm able to sustain 3000+ iops per cloud host simultaneously. Our workload at this time for the VMs are MSSQL servers, MySQL servers, and BI servers (Pentaho). We also have our ELK stack and collectd/Graphite/Grafana stack in this specific cloud. 

In the end, the root cause of the issue, based on my testing and investigations, centers around the mismatch of the block sizes between the VMs (4kb buffered to 16kb default) and Ceph (4MB blocks).

-- 
Stephen Mercier
Senior Systems Architect
Attainia, Inc.
Phone: 866-288-2464 ext. 727
Email: stephen.mercier@xxxxxxxxxxxx
Web: www.attainia.com

Capital equipment lifecycle planning & budgeting solutions for healthcare

On Jun 30, 2015, at 10:49 AM, Tuomas Juntunen wrote:

Hi

This is something I was thinking too. But it doesn’t take away the problem.

Can you share your setup and how many VM’s you are running, that would give us some starting point on sizing our setup.

Thanks

Br,
Tuomas

From: Stephen Mercier [mailto:stephen.mercier@xxxxxxxxxxxx] 
Sent: 30. kesäkuuta 2015 20:32
To: Tuomas Juntunen
Cc: 'Somnath Roy'; 'ceph-users'
Subject: Re:  Very low 4k randread performance ~1000iops

I ran into the same problem. What we did, and have been using since, is increased the read ahead buffer in the VMs to 16MB (The sweet spot we settled on after testing). This isn't a solution for all scenarios, but for our uses, it was enough to get performance inline with expectations.

In Ubuntu, we added the following udev config to facilitate this:

root@ubuntu:/lib/udev/rules.d# vi /etc/udev/rules.d/99-virtio.rules 

SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION="" KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384", ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"

Cheers,
-- 
Stephen Mercier
Senior Systems Architect
Attainia, Inc.
Phone: 866-288-2464 ext. 727
Email: stephen.mercier@xxxxxxxxxxxx
Web: www.attainia.com

Capital equipment lifecycle planning & budgeting solutions for healthcare

On Jun 30, 2015, at 10:18 AM, Tuomas Juntunen wrote:

Hi

It’s not probably hitting the disks, but that really doesn’t matter. The point is we have very responsive VM’s while writing and that is what the users will see.
The iops we get with sequential read is good, but the random read is way too low.

Is using SSD’s as OSD’s the only way to get it up? or is there some tunable which would enhance it? I would assume Linux caches reads in memory and serves them from there, but atleast now we don’t see it.

Br,
Tuomas

From: Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx] 
Sent: 30. kesäkuuta 2015 19:24
To: Tuomas Juntunen; 'ceph-users'
Subject: RE:  Very low 4k randread performance ~1000iops

Break it down, try fio-rbd to see what is the performance you getting..
But, I am really surprised you are getting > 100k iops for write, did you check it is hitting the disks ?

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Tuomas Juntunen
Sent: Tuesday, June 30, 2015 8:33 AM
To: 'ceph-users'
Subject:  Very low 4k randread performance ~1000iops

Hi

I have been trying to figure out why our 4k random reads in VM’s are so bad. I am using fio to test this.

Write : 170k iops
Random write : 109k iops
Read : 64k iops
Random read : 1k iops

Our setup is:
3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem & 2x6core cpu’s
4 monitors running on other servers
40gbit infiniband with IPoIB
Openstack : Qemu-kvm for virtuals

Any help would be appreciated

Thank you in advance.

Br,
Tuomas

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com