Hadoop DNS/topology details

Noah Watkins <jayhawk@xxxxxxxxxxx> · Tue, 19 Feb 2013 14:10:44 -0800

Here is the information that I've found so far regarding the operation of Hadoop w.r.t. DNS/topology. There are two parts, the file system client requirements, and other consumers of topology information.

-- File System Client --

The relevant interface between the Hadoop VFS and its underlying file system is:

  FileSystem:getFileBlockLocations(File, Extent)

which is expected to return a list of hosts (a 3-tuple: hostname, IP, topology path) for each block that contains any part of the specified file extent. So, with triplication and 2 blocks, there are 2 * 3 = 6 3-tuples present.

  *** Note: HDFS sorts each list of hosts based on a distance metric applied between the initiating file system client and each of the blocks in the list using the HDFS cluster map. This should not affect correctness, although it's possible that consumers of this list (e.g. MapReduce) may assume an ordering. ***

The current Ceph client can produce the same list, but does not include hostname nor topology information. Currently reverse DNS is used to fill in the hostname, and defaults to a flat topology in which all hosts are in a single topology path: "/default-rack/host".

- Reverse DNS could be quite slow:
   - 3x replication * 1 TB / 64 MB blocks = 49152 lookups
   - Caching lookups could help

-- Topology Information --

Services that run on a Hadoop cluster (such as MapReduce) use hostname and topology information attached to each file system block to schedule and aggregate work based on various policies. These services don't have direct access to the HDFS cluster map, and instead rely on a service to provide a mapping:

   DNS-names/IP -> topology path mapping

This can be performed using a script/utility program that will perform bulk translations, or implemented in Java.

-- A Possible Approach --

1. Expand CephFS interface to return IP and hostname
2. Build a Ceph tool to perform DNS-name/IP -> topology path mapping

Using (2) from the Hadoop shim we can perform distance sorting, as well as resolve the topology information. The tool will also be used by other Hadoop services that can make use of the topology.

This would seem like a good incremental step forward. There are a _lot_ of other analytics systems out there that might be interested in running on top of Ceph, including the next-generation Hadoop releases, all of which may have slightly different requirements. So wedding ourselves to an expansion of the CephFS API at this point might be a little premature. On the other hand, providing all information now should cover our bases later :)

- Noah--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html