Re: FW: Ceph data locality

Dmitry Meytin <dmitry.meytin@xxxxxxxxxx> · Tue, 7 Jul 2015 15:41:12 +0000

Hi Lionel,
Thanks for the answer.
The missing info:
1) Ceph 0.80.9 "Firefly"
2) map-reduce makes sequential reads of blocks of 64MB (or 128 MB)
3) HDFS which is running on top of Ceph is replicating data for 3 times between VMs which could be located on the same physical host or different hosts
4) Network is 10 Gb/s NIC (but mtu is just 1500), Open vSwitch 2.3.1
5) Ceph health is OK
6) Servers are Dell r720 128 MB RAM and 2x2TB SATA disks plus one SSD for Ceph journaling

I'm testing Hadoop terasort with 100GB/500GB/1TB/10TB of data.
As more data and bigger cluster as worse performance.

Any ideas how to improve it?

Thank you very much,
Dmitry

-----Original Message-----
From: Lionel Bouton [mailto:lionel+ceph@xxxxxxxxxxx] 
Sent: Tuesday, July 07, 2015 6:07 PM
To: Dmitry Meytin
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  FW: Ceph data locality

Hi Dmitry,

On 07/07/15 14:42, Dmitry Meytin wrote:
> Hi Christian,
> Thanks for the thorough explanation.
> My case is Elastic Map Reduce on top of OpenStack with Ceph backend for everything (block, object, images).
> With default configuration, performance is 300% worse than bare metal.
> I did a few changes:
> 1) replication settings 2
> 2) read ahead size 2048Kb
> 3) Max sync intervals 10s
> 4) Large queue and large bytes
> 5) OSD OP threads 20
> 6) FileStore Flusher off
> 7) Sync Flush On
> 8) Object size 64 Mb
>
> And still the performance is poor when comparing to bare-metal.

Describing how you test performance with bare-metal would help identify if this is expected behavior or a configuration problem. If you try to compare sequential access to individual local disks with Ceph it's an apple to orange comparison (for example Ceph RBD isn't optimized for this by default and I'm not sure how far stripping/order/readahead tuning can get you). If you try to compare random access to 3 way RAID1 devices to random access to RBD devices on pools with size=3 then it becomes more relevant.

I didn't see any description of the hardware and network used for Ceph which might help identify a bottleneck. The Ceph version is missing too.

When you test Ceph performance is ceph -s reporting HEALTH_OK (if not this would have performance impact)? Is there any deep-scrubbing going on (this will limit your IO bandwidth especially if several happens at the same time)?

> The profiling shows the huge network demand (I'm running terasort) during the map phase.

It's expected with Ceph. Your network should have the capacity for your IO targets. Note that if your data is easy to restore you can get better write performance with size=1 or size=2 depending on the trade-off you want between durability and performance.

> I want to avoid shared-disk behavior of Ceph and I would like VM to read data from the local volume as much as applicable.
> Am I wrong with mu assumptions?

Yes : Ceph is a distributed storage network, there's no provision for local storage. Note that 10Gbit networks (especially dual 10Gbit) and some tuning should in theory give you plenty of read performance with Ceph (far more than any local disk could provide except NVME storage or similar tech). You may be limited by latencies and the read or write patterns of your clients though. Ceph total bandwith is usually reached when you have heavy concurrent accesses.

Note that if you use map reduce with a Ceph cluster you should probably write any intermediate results to local storage instead of Ceph as it doesn't bring any real advantage for them (the only data that you should store on Ceph is what you want to keep after the map reduce so probably the initial input and the final output if it is meant to be stored).

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com