Re: Cephfs Hadoop Plugin and CEPH integration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 27, 2017 at 12:55 PM Aristeu Gil Alves Jr <aristeu.jr@xxxxxxxxx> wrote:
Hi.

It's my first post on the list. First of all I have to say I'm new on hadoop. 

We are here a small lab and we have being running cephfs for almost two years, loading it with large files (4GB to 4TB in size). Our cluster is with approximately with 400TB with ~75% of usage, and we are planning to grow a lot.

Until now, we did process most of the files the "serial reading" way. But now we will try to implement a parallel process on this files and we are looking on the hadoop plugin as a solution for using mapreduce, or something like that.

Does the hadoop plugin access cephfs over the network as a normal cluster or I can install the hadoop's processors on every ceph node and process the data locally?

The Hadoop plugin both
1) accesses CephFS over the network as a normal client+cluster,
2) is fully integrated with the data-layout APIs, so if you install Hadoop on the Ceph nodes it will generally schedule work on the primary OSD for the data chunk in question.

So, it works almost the same in terms of data and network as HDFS does. (HDFS will usually do a local write for one of its copies; CephFS+Hadoop doesn't do that.) A few caveats though:
1) the plugin is not maintained very well. It was updated a few years ago for the Hadoop 2.x API changes, and I've seen a few PRs from users go by updating minor things, so it should still be good. But there's not any proactive work going on in the core upstream development teams.
2) Data you've currently got stored in CephFS is probably in 4MB chunks, as that's the default. When using Hadoop we default to 64MB for new data. Hadoop is unlikely to want to schedule a different job for each 4MB piece of data, so you will probably get more network traffic on your existing data than you'd otherwise expect.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux