Re: Cephfs Hadoop Plugin and CEPH integration

Orit Wasserman <owasserm@xxxxxxxxxx> · Wed, 29 Nov 2017 19:42:57 +0200



On Wed, Nov 29, 2017 at 6:52 PM, Aristeu Gil Alves Jr
<aristeu.jr@xxxxxxxxx> wrote:
>> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
>> > for
>> > local processing data as have cephfs hadoop plugin?
>> >
>> With s3 and swift you won't have data locality as it was designed for
>> public cloud.
>> We recommend disable locality based scheduling in Hadoop when running
>> with those connectors.
>> There is on going work on to optimize those connectors to work with
>> object storage.
>> Hadoop community works on the s3a connector.
>> There is also https://github.com/SparkTC/stocator which is a swift
>> based connector IBM wrote  for their cloud.
>
>
>
> Assuming this cases, how would be a mapreduce process without data locality?
> How the processors get the data? Still there's the need to split the data,
> no?
The s3/swift storage splits the data.

> Doesn't it severely impact the performance of big files (not just the
> network)?
>
There is a facebook research paper showing locality is not as good as
expected, if I remember correctly it was around 30%.
The users that use s3/swift with Hadoop are already using object
storage (for other usages) or have a very very big data set that fits
object storage better.

> --
> Aristeu
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com