Re: [PATCH] Expose Ceph data location information to Hadoop

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 30 Nov 2010 11:01:17 -0800 (PST)

On Tue, 30 Nov 2010, Noah Watkins wrote:
> > Yeah, not pretty.  A shared library wouldn't really help here either, 
> > right?  And a command-line tool means additional overhead.  JNI (or 
> > equivalent) calling an ioctl seems like the most appropriate tool.
> 
> I like the motivation behind the command-line tool, but agree with you 
> on the overhead issues. The only other method for arbitrary 
> communication between processes that comes to mind for this situation is 
> a socket-based approach.
> 
> This could take two forms:
> 
> 1) A user-space daemon to service requests from Hadoop
> 2) A socket between kernel and user-space to service requests.
> 
> The former is unattractive because it requires additional client setup, 
> while the latter also poses challenges. However, If this approach seems 
> attractive we could begin to experiment with the second option it in 
> DebugFS to avoid ABI lock in?

Well, if the goal is just to get something working to test, then I would 
use JNI and use the ioctl; whether it can go upstream easily isn't 
relevant.  If the goal is something that can go upstream, then it 
needs a stable ABI, and debugfs isn't really a solution there either.  

Are we sure JNI is a real problem?  It really seems like the right tool 
for the job.  Greg seems to remember them asking who would maintain the 
(non-java) JNI bits, but even if that's us and not them (which is probably 
the way to go anyway), I don't see that that's a problem.

> >>> The ioctl struct ceph_ioctl_dataloc already returns the primary copy's 
> >>> object offset for an input file offset, though I think it would be a 
> >>> little more useful if it included replica offsets.
> >> I can submit a patch for this. Sage, I remember you mentioning that 
> >> reading from replicas might pose (scalability?) problems. Any thoughts 
> >> on this?
> > 
> > There are two things.  First, we'd need a DATALOC_V2 ioctl that would 
> > return locations for all replicas of the object.  Is Hadoop smart about 
> > scheduling jobs on the best replica?
> 
> Good question. I'm not sure what its scheduling policy is, but replica 
> location is a key component of the Hadoop API, providing the information 
> to the scheduler by default.

Let's start with just providing the primary replica, at least until we 
find out whether hadoop takes advantage of additional ones (does HDFS read 
from the local non-primary replica?).

sage

> 
> > Before going to the trouble, though, I want to make sure we'll really 
> > benefit from all of that...
> 
> I agree. This enhancement is orthogonal to the overall design.
> 
> Thanks,
> Noah
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html