Re: under performing osd, where to look ?

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Mon, 08 Apr 2013 07:55:41 -0500

On 04/08/2013 01:09 AM, Matthieu Patou wrote:
On 04/01/2013 11:26 PM, Matthieu Patou wrote:
On 04/01/2013 05:35 PM, Mark Nelson wrote:
On 03/31/2013 06:37 PM, Matthieu Patou wrote:
Hi,

I was doing some testing with iozone and found that performance of an
exported rdb volume where 1/3 of the performance of the hard drives.
I was expecting to have a performance penalty but not so important.
I suspect something is not correct in the configuration but I can't
what
exactly.

A couple of things:

1) Were you testing writes?  If so, were your journals on the same
disks as the OSDs?  Ceph writes a full copy of the data to the
journal currently which means you are doing 2 writes for every 1.
This is why some people use fast SSDs for 3-4 journals each.  On the
other hand, that means that losing a journal causes more data
replication and you may lose read performance and capacity if the SSD
is taking up a spot that an OSD would have otherwise occupied.
Oh right, no I don't, I'll try to do a quick test with a ramdrive
(this is just test data I can afford to loose them).
I guess that one SSD can stand the load of 4 HDD maybe more but you're
right if you try to go too far (ie. maybe 20+ OSD) then you might see
the SSD be your new bottleneck.
Ok I did my homework and moved the journal away, it improved the
performance but I'm still ~ 1/3 of the performance of iozone on the hard
drive.

I also briefly added a ramdisk to ceph and benchmarked it and got much
better perf (~180MB/s) but it's still far from the perf of the same
ramdisk benchmarked on one of the osd host.

I'm still convinced that something is not going well because if I run:
rados -p rbd bench 300 write -t 4

I can reach 90% of the performance of the benchmark, the client is
running a stock ubuntu 12.10 kernel:

Linux builder01 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 UTC
2013 x86_64 x86_64 x86_64 GNU/Linu

2) Are you factoring in any replication being done?
What do you mean ?
To my understanding when osd0 is receiving the data it should send
them directly to osd1, or will it write it and then read it ?

IE if you are doing write tests, is the pool you are writing to set to 
use 2x replication (ie the default)?  For performance benchmarking I 
usually set to to 1x replication (ie no replication) unless I am 
explicitly testing a replication scenario.

3) Do you have enough concurrent operations to hide latency? You may
want to play with different io depth values in iozone and see if that
helps.  you may also want to try using multiple clients at the same
time.
Well latency might be problem with random IO but for sequential read
and write it shouldn't client and osds are on the same LAN.
I tried to play with the number of client of the throughput mode of
iozone without much success, I always get the same throughput (with
5/10% variation).

I made a test by creating 2 rbd volumes and starting one iozone per
mount point, the result is that the aggregated performance is the same.

Ok.  Probably not enough drives in your setup for it to really matter. 
On some tests I ran a little while back with more OSDs, it was important 
to have lots of concurrency to reach high speeds.  For 4MB writes, I 
could hit a little over 700MB/s on 1 volume with fio using an iodpeth > 
8.  With 16 volumes I could hit closer to about 1.4GB/s no matter the 
iodepth.

4) If you are using QEMU/KVM, is rbd cache enabled?  Also, are you
using the virtio driver?  It's significantly faster than the ide driver.

No for the moment I was just reviewing the pure rdb layer.
I just read about the rbd cache, which I'm not using of course it could
give a boost but that's not what I'm trying to measure right now.

Understood.  Until very recently there was also a limitation with rbd 
cache that caused per-volume throughput to be capped at around 200MB/s 
as well.

5) There's always going to be a certain amount of overhead.  I have a
node with 24 drives and all journals on SSDs.  Theoretically the
drives can do about 3.4GB/s aggregate just doing fio directly to the
disks. With EXT4/BTRFS I can do anywhere from 2.2-2.6GB/s with RADOS
bench writing objects our directly to ceph.  With XFS, that drops to
~1.9GB/s max.  Now throw QEMU/KVM RBD on and performance drops to
about 1.5GB/s.  Each layer of software introduces a little more
overhead.  Having said that, 1.5GB/s out of a ~$10-12k node using a
system that can grow on demand and replicate the way ceph can is
fantastic imho, and every major release so far is seeing big
performance improvements.

We doing the test I did in both case with XFS so in both cases the
cost of the FS should be the same, maybe I have to consider the
barrier setting on XFS.

Did the test with ext2 (as the FS of the formated rbd) and didn't notice
much difference.

This has more to do with the filesystem of the underlying OSDs.  The 
kind of workload that iozone does is different than the workload that is 
going to result from Ceph.  Specifically, there's overhead involved in 
how the objects get laid out on the filesystem and metadata that gets 
stored.  The way that data gets journaled in Ceph also plays a role in 
the resultsing write performance (it's different for BTRFS and 
XFS/EXT4), and there's there are more opportunities for IOs to be broken 
up along the way.  Different filesystems handle all of these things in 
different ways and may perform better or worse.

Thanks for the hints in anycase I'll do further investigations.

Some general thoughts:

- Verify 1x replication pool.
- Check PGs per pool.
- Your networks are only 1GbE.  What does your performance look like 
relative to your network speed?  Have you verified that the traffic is 
going through the right interfaces?
- Use collectl and check the admin socket to see if IOs are getting hung 
up on any specific OSD(s).
- potentially test btrfs or ext4 on the underlying OSDs.

Matthieu.
Mark

My test setup consists of 2 OSD with one drive each they have 2 network
(public/private).
On osd0 and osd1 I'm using a BCM5751 gigabit card for the public
network
and a intel 82541PI for the private.
On the client I'm using a RTL8111/8168B for the public network.

Osd0 & Osd1 are connect directly with a cable for the private network
and they are connected through one switch for the public network.

Where could the bottleneck be located ?

Matthieu.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com