Re: ceph data locality

"Johnu George (johnugeo)" <johnugeo@xxxxxxxxx> · Mon, 8 Sep 2014 22:53:56 +0000

Hi Greg,
       Thanks. Can you explain more on "Ceph *does* export locations so
the follow-up jobs can be scheduled appropriately”?

Thanks,
Johnu

On 9/8/14, 12:51 PM, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:

>On Thu, Sep 4, 2014 at 12:16 AM, Johnu George (johnugeo)
><johnugeo@xxxxxxxxx> wrote:
>> Hi All,
>>         I was reading more on Hadoop over ceph. I heard from Noah that
>> tuning of Hadoop on Ceph is going on. I am just curious to know if there
>> is any reason to keep default object size as 64MB. Is it because of the
>> fact that it becomes difficult to encode
>>  getBlockLocations if blocks are divided into objects and to choose the
>> best location for tasks if no nodes in the system has a complete block.?
>
>We used 64MB because it's the HDFS default and in some *very* stupid
>tests it seemed to be about the fastest. You could certainly make it
>smaller if you wanted, and it would probably work to multiply it by
>2-4x, but then you're using bigger objects than most people do.
>
>> I see that Ceph doesn¹t place objects considering the client location or
>> distance between client and the osds where data is
>>stored.(data-locality)
>> While, data locality is the key idea for HDFS block placement and
>> retrieval for maximum throughput. So, how does ceph plan to perform
>>better
>> than HDFS as ceph relies on random placement
>>  using hashing unlike HDFS block placement? Can someone also point out
>> some performance results comparing ceph random placements vs hdfs
>>locality
>> aware placement?
>
>I don't think we have any serious performance results; there hasn't
>been enough focus on productizing it for that kind of work.
>Anecdotally I've seen people on social media claim that it's as fast
>or even many times faster than HDFS (I suspect if it's many times
>faster they had a misconfiguration somewhere in HDFS, though!).
>In any case, Ceph has two plans for being faster than HDFS:
>1) big users indicate that always writing locally is often a mistake
>and it tends to overfill certain nodes within your cluster. Plus,
>networks are much faster now so it doesn't cost as much to write over
>it, and Ceph *does* export locations so the follow-up jobs can be
>scheduled appropriately.
>
>>
>> Also, Sage wrote about a way to specify a node to be primary for hadoop
>> like environments.
>> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/1548 ) Is
>> this through primary affinity configuration?
>
>That mechanism ("preferred" PGs) is dead. Primary affinity is a
>completely different thing.
>
>
>On Thu, Sep 4, 2014 at 8:59 AM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>> QFS unlike Ceph places the erasure coding logic inside of the client
>> so it's not a apples-to-apples comparison. but I think you get my
>> point, and it would be possible to implement a rich Ceph
>> (filesystem/hadoop) client like this as well.
>>
>> In summary, if Hadoop on Ceph is a major priority I think it would be
>> best to "borrow" the good ideas for QFS and implement them in Hadoop
>> Ceph filesystem and Ceph it self (letting a smart client get chunks
>> directly, write chunks directly). I don't doubt that it's a lot of
>> work but the results might be worth it in in terms of performance you
>> get for the cost.
>
>Unfortunately implementing CephFS on top of RADOS' EC pools is going
>to be a major project which we haven't done anything to scope out yet,
>so it's going to be a while before that's really an option. But it is
>a "real" filesystem, so we still have that going for us. ;)
>-Greg
>Software Engineer #42 @ http://inktank.com | http://ceph.com
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@xxxxxxxxxxxxxxx
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

ÿ淸º{.nÇ+돴윯돪†+%듚ÿ깁負¥Šwÿº{.nÇ+돴œz˜ÿu銀쀸㎍썳變}©옽Æ zÚ&j:+v돣?®w?듺2듷솳鈺Ú&¢)傘«a뛴ÿÿ鎬z요z받쀺+껠šŽ듶¢jÿŠw療f