On Wed, Nov 29, 2017 at 6:52 PM, Aristeu Gil Alves Jr <aristeu.jr@xxxxxxxxx> wrote: >> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs >> > for >> > local processing data as have cephfs hadoop plugin? >> > >> With s3 and swift you won't have data locality as it was designed for >> public cloud. >> We recommend disable locality based scheduling in Hadoop when running >> with those connectors. >> There is on going work on to optimize those connectors to work with >> object storage. >> Hadoop community works on the s3a connector. >> There is also https://github.com/SparkTC/stocator which is a swift >> based connector IBM wrote for their cloud. > > > > Assuming this cases, how would be a mapreduce process without data locality? > How the processors get the data? Still there's the need to split the data, > no? The s3/swift storage splits the data. > Doesn't it severely impact the performance of big files (not just the > network)? > There is a facebook research paper showing locality is not as good as expected, if I remember correctly it was around 30%. The users that use s3/swift with Hadoop are already using object storage (for other usages) or have a very very big data set that fits object storage better. > -- > Aristeu _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com