Re: [PATCH] Expose Ceph data location information to Hadoop

Joe Buck <buck@xxxxxxxxxxxx> · Tue, 30 Nov 2010 12:25:49 -0800

Having been in the code in Hadoop, each input split (a chunk of data to be processed), has with it a set of hosts to run on. But, that list is not a hard limit, rather it's merely a strong suggestion. Hadoop will work, in the sense that the map() and reduce() functions will execute and results will be produced, with no locality information. But, the whole point of Hadoop is to do in-situ processing so we want to give Hadoop the most opportunities to place the job on  host that has the data locally as we can.
Having only one host specified per block of data, as I assume Alex's implementation did from this discussion, then it's a lot more likely that a non-local host would read the data. Enabling Hadoop reads from replicas would make the load balancing a lot more efficient in terms of reads being local.

-Joe Buck

On Nov 30, 2010, at 11:48 AM, Carlos Maltzahn wrote:

> 
> On Nov 30, 2010, at 11:04 AM, Noah Watkins wrote:
> 
>>> Are we sure JNI is a real problem?  It really seems like the right tool 
>>> for the job.  Greg seems to remember them asking who would maintain the 
>>> (non-java) JNI bits, but even if that's us and not them (which is probably 
>>> the way to go anyway), I don't see that that's a problem.
>> 
>> Yeh, it's sort of a wash. A nice goal would be to have a patch that allowed Hadoop to not require any additional components (i.e. JNI packages) from the Ceph repository. Given that the Ceph infrastructure will be installed anyway in the case of Hadoop, it's a bit of a toss up.
>> 
>> -n
>> 
>>> Let's start with just providing the primary replica, at least until we 
>>> find out whether hadoop takes advantage of additional ones (does HDFS read 
>>> from the local non-primary replica?).
>> 
>> I believe that Hadoop will schedule a map job on at a local replica for load balancing, or to duplicate the work when a map is running slowly. Joe, can you confirm this?
> 
> Let me chime in: I believe Hadoop is scheduling by a combination of load and locality (and maybe other variables). I think the more choices we give to the task manager to satisfy both load and locality requirements, the better it will perform. Allowing Hadoop to see all replicas will allow us to verify the tradeoff between mapper performance and cost of additional replication.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html