Re: FW: Ceph data locality

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Tue, 07 Jul 2015 17:07:15 +0200

Hi Dmitry,

On 07/07/15 14:42, Dmitry Meytin wrote:
> Hi Christian,
> Thanks for the thorough explanation.
> My case is Elastic Map Reduce on top of OpenStack with Ceph backend for everything (block, object, images).
> With default configuration, performance is 300% worse than bare metal.
> I did a few changes:
> 1) replication settings 2
> 2) read ahead size 2048Kb 
> 3) Max sync intervals 10s
> 4) Large queue and large bytes
> 5) OSD OP threads 20
> 6) FileStore Flusher off
> 7) Sync Flush On
> 8) Object size 64 Mb
>
> And still the performance is poor when comparing to bare-metal.

Describing how you test performance with bare-metal would help identify
if this is expected behavior or a configuration problem. If you try to
compare sequential access to individual local disks with Ceph it's an
apple to orange comparison (for example Ceph RBD isn't optimized for
this by default and I'm not sure how far stripping/order/readahead
tuning can get you). If you try to compare random access to 3 way RAID1
devices to random access to RBD devices on pools with size=3 then it
becomes more relevant.

I didn't see any description of the hardware and network used for Ceph
which might help identify a bottleneck. The Ceph version is missing too.

When you test Ceph performance is ceph -s reporting HEALTH_OK (if not
this would have performance impact)? Is there any deep-scrubbing going
on (this will limit your IO bandwidth especially if several happens at
the same time)?

> The profiling shows the huge network demand (I'm running terasort) during the map phase.

It's expected with Ceph. Your network should have the capacity for your
IO targets. Note that if your data is easy to restore you can get better
write performance with size=1 or size=2 depending on the trade-off you
want between durability and performance.

> I want to avoid shared-disk behavior of Ceph and I would like VM to read data from the local volume as much as applicable.
> Am I wrong with mu assumptions?

Yes : Ceph is a distributed storage network, there's no provision for
local storage. Note that 10Gbit networks (especially dual 10Gbit) and
some tuning should in theory give you plenty of read performance with
Ceph (far more than any local disk could provide except NVME storage or
similar tech). You may be limited by latencies and the read or write
patterns of your clients though. Ceph total bandwith is usually reached
when you have heavy concurrent accesses.

Note that if you use map reduce with a Ceph cluster you should probably
write any intermediate results to local storage instead of Ceph as it
doesn't bring any real advantage for them (the only data that you should
store on Ceph is what you want to keep after the map reduce so probably
the initial input and the final output if it is meant to be stored).

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com