Re: Hadoop DNS/topology details

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 19 Feb 2013 14:22:32 -0800

On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins <jayhawk@xxxxxxxxxxx> wrote:
> Here is the information that I've found so far regarding the operation of Hadoop w.r.t. DNS/topology. There are two parts, the file system client requirements, and other consumers of topology information.
>
> -- File System Client --
>
> The relevant interface between the Hadoop VFS and its underlying file system is:
>
>   FileSystem:getFileBlockLocations(File, Extent)
>
> which is expected to return a list of hosts (a 3-tuple: hostname, IP, topology path) for each block that contains any part of the specified file extent. So, with triplication and 2 blocks, there are 2 * 3 = 6 3-tuples present.
>
>   *** Note: HDFS sorts each list of hosts based on a distance metric applied between the initiating file system client and each of the blocks in the list using the HDFS cluster map. This should not affect correctness, although it's possible that consumers of this list (e.g. MapReduce) may assume an ordering. ***

That is just truly annoying. Is this described anywhere in their docs?
I don't think it would be hard to sort, if we had some mechanism for
doing so (crush map nearness, presumably?), but if doing it wrong is
expensive in terms of performance we'll want some sort of contract to
code to.

> The current Ceph client can produce the same list, but does not include hostname nor topology information. Currently reverse DNS is used to fill in the hostname, and defaults to a flat topology in which all hosts are in a single topology path: "/default-rack/host".
>
> - Reverse DNS could be quite slow:
>    - 3x replication * 1 TB / 64 MB blocks = 49152 lookups
>    - Caching lookups could help
>
> -- Topology Information --
>
> Services that run on a Hadoop cluster (such as MapReduce) use hostname and topology information attached to each file system block to schedule and aggregate work based on various policies. These services don't have direct access to the HDFS cluster map, and instead rely on a service to provide a mapping:
>
>    DNS-names/IP -> topology path mapping
>
> This can be performed using a script/utility program that will perform bulk translations, or implemented in Java.
>
> -- A Possible Approach --
>
> 1. Expand CephFS interface to return IP and hostname

Ceph doesn't store hostnames anywhere — it really can't do this. All
it has is IPs associated with OSD ID numbers. :) Adding hostnames
would be a monitor and map change, which we could do, but given the
issues we've had with hostnames in other contexts I'd really rather
not.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html