On Wed, Nov 29, 2017 at 6:54 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Wed, Nov 29, 2017 at 8:52 AM Aristeu Gil Alves Jr <aristeu.jr@xxxxxxxxx> > wrote: >>> >>> > Does s3 or swifta (for hadoop or spark) have integrated data-layout >>> > APIs for >>> > local processing data as have cephfs hadoop plugin? >>> > >>> With s3 and swift you won't have data locality as it was designed for >>> public cloud. >>> We recommend disable locality based scheduling in Hadoop when running >>> with those connectors. >>> There is on going work on to optimize those connectors to work with >>> object storage. >>> Hadoop community works on the s3a connector. >>> There is also https://github.com/SparkTC/stocator which is a swift >>> based connector IBM wrote for their cloud. >> >> >> >> Assuming this cases, how would be a mapreduce process without data >> locality? >> How the processors get the data? Still there's the need to split the data, >> no? >> Doesn't it severely impact the performance of big files (not just the >> network)? >> > > Given that you already have your data in CephFS (and have been using it > successfully for two years!), I'd try using its Hadoop plugin and seeing if > it suits your needs. Trying a less-supported plugin is a lot easier than > rolling out a new storage stack! :) completely agree :) > -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com