Re: FW: Ceph data locality

Wido den Hollander <wido@xxxxxxxx> · Tue, 07 Jul 2015 14:52:25 +0200

On 07-07-15 14:42, Dmitry Meytin wrote:
> Hi Christian,
> Thanks for the thorough explanation.
> My case is Elastic Map Reduce on top of OpenStack with Ceph backend for everything (block, object, images).
> With default configuration, performance is 300% worse than bare metal.
> I did a few changes:
> 1) replication settings 2
> 2) read ahead size 2048Kb 
> 3) Max sync intervals 10s
> 4) Large queue and large bytes
> 5) OSD OP threads 20
> 6) FileStore Flusher off
> 7) Sync Flush On
> 8) Object size 64 Mb
> 
> And still the performance is poor when comparing to bare-metal.
> The profiling shows the huge network demand (I'm running terasort) during the map phase.
> I want to avoid shared-disk behavior of Ceph and I would like VM to read data from the local volume as much as applicable.
> Am I wrong with mu assumptions?
> 

Nothing is free in this world. Keep that in mind.

You can't have a fully distributed system which replicates all data
synchronously over a network and not suffer a performance loss or impact.

Yes, Ceph has still ways where it should improve, but all the data has
to go over the network and written to all disks prior to being ack'ed.

Wido

> Thank you very much,
> Dmitry
> 
> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx] 
> Sent: 07 July 2015 15:25
> To: ceph-users@xxxxxxxx
> Cc: Dmitry Meytin
> Subject: Re:  FW: Ceph data locality
> 
> 
> Hello,
> 
> On Tue, 7 Jul 2015 11:45:11 +0000 Dmitry Meytin wrote:
> 
>> I think it's essential for huge data clusters to deal with data locality.
>> Even very expensive network stack (100Gb/s) will not mitigate the 
>> problem if you need to move petabytes of data many times a day. Maybe 
>> there is some workaround  to the problem?
>>
> Apples, Oranges. 
> Not every problem is a nail even if your preferred tool is a hammer.
>  
> Ceph is a _distributed_ storage system. 
> Data locality is not one of its design goals, a SAN or NAS isn't really "local" to any client either.
> 
> The design is to have completely independent storage nodes talking to clients.
> 
> And if you have petabytes of data each day, there is no way to have all that data local.
> Never mind that by default, you will have 3 replicas, so 2 other nodes will have to receive that data over the network anyway. 
> And the write won't be complete until it is done.
> 
> If you scale allows it or if you can enforce a split on a layer above the storage into smaller segments, DRBD will give you local reads. 
> Of course that IS limited to the amount of disk space per node.
> 
> Ceph on the other hand can scale massively and with each OSD and storage node the performance increases. 
> 
> That all being said, if you would have googled a bit or read all the documentation, you would have found references to primary affinity.
> And other methods to have some form of data locality for use with Hadoop.
> All of which aren't particular good or scalable. 
> 
> Would locality be nice in some use cases? 
> Hell yeah, but not at the cost of other, much more pressing issues.
> Like the ability for Ceph to actually repair itself w/o human intervention and a magic 8 ball. 
> 
> Christian
> 
>>
>>
>> From: Van Leeuwen, Robert [mailto:rovanleeuwen@xxxxxxxx]
>> Sent: Tuesday, July 07, 2015 12:59 PM
>> To: Dmitry Meytin
>> Subject: Re:  Ceph data locality
>>
>>> I need a help to configure clients to write data to the primary osd 
>>> on the local server. I see a lot of networking when VM is trying to 
>>> read data which was written by the same VM, What I'm expecting to is 
>>> the VM to read data from the local machine as the first replica of the data.
>>> How to configure the CRUSH rules to make it happen?
>>
>> This functionality is not in Ceph.
>> Ceph has no notion about locality: faster "local nodes" vs slower 
>> "remote nodes". The only thing you can configure is a failure domain 
>> which just makes sure the data is properly spread across the DC.
>>
>> Cheers,
>> Robert van Leeuwen
>>
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com