KC, Thanks a lot for checking that out. I just went to investigate, and the work we have done on the locality/topology-aware features are sitting in a branch, and have not been merged into the tree that is used to produce the JAR file you are using. I will get that cleaned up and merged soon, and I think that should solve your problems :) -Noah On Mon, Jul 8, 2013 at 6:22 PM, ker can <kercan74@xxxxxxxxx> wrote: > hi Noah, okay I think the current version may have a problem haven't figured > out where yet. Looking at the log messages and how the data blocks are > distributed among the OSDs. > > So, the job tracker log had for example this output for the map task for the > first split/block 0 – which it’s executing on host vega7250. > > .... > > .... > > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker: Adding > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip > task_201307081115_0001_m_000000, for tracker > 'tracker_vega7250:localhost/127.0.0.1:35422' > > ... > > ... > > > If I look at how the blocks are divided up among the OSDs, block 0 for > example is managed by OSD#2 – which is running on host vega7249. However our > map task for block 0 is running on another host. Definitely not co-located. > > > > FILE OFFSET OBJECT OFFSET LENGTH > OSD > > 0 10000000dbe.00000000 0 > 67108864 2 > > 67108864 10000000dbe.00000001 0 67108864 > 13 > > 134217728 10000000dbe.00000002 0 67108864 > 5 > > 201326592 10000000dbe.00000003 0 67108864 > 4 > > …. > > …. > > > > Ceph osd tree: > > # id weight type name up/down reweight > > -1 14 root default > > -3 14 rack unknownrack > > -2 7 host vega7249 > > 0 1 osd.0 up 1 > > 1 1 osd.1 up 1 > > 2 1 osd.2 up 1 > > 3 1 osd.3 up 1 > > 4 1 osd.4 up 1 > > 5 1 osd.5 up 1 > > 6 1 osd.6 up 1 > > -4 7 host vega7250 > > 10 1 osd.10 up 1 > > 11 1 osd.11 up 1 > > 12 1 osd.12 up 1 > > 13 1 osd.13 up 1 > > 7 1 osd.7 up 1 > > 8 1 osd.8 up 1 > > 9 1 osd.9 up 1 > > > Thanks > KC > > > On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> > wrote: >> >> Yes, all of the code needed to get the locality information should be >> present the version of the jar file you referenced. We have tested a >> to make sure the right data is available, but have not extensively >> tested that it is being used correctly by core Hadoop (e.g. that is >> being correctly propagated out of CephFileSystem). IIRC fixing this >> /should/ be pretty easy; fiddling with getFileBlockLocation. >> >> On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote: >> > Hi Noah, >> > >> > I'm using the CephFS jar from ... >> > http://ceph.com/download/hadoop-cephfs.jar >> > I beleive this is built from hadoop-common/cephfs/branch-1.0 ? >> > >> > If thats the case, I should already be using an implementation thats got >> > getFileBlockLocations() ... which is here >> > >> > https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java >> > >> > Is there a command line tool that I can use to verify the results from >> > getFileBlockLocations() ? >> > >> > thanks >> > KC >> > >> > >> > >> > On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> >> > wrote: >> >> >> >> Hi KC, >> >> >> >> The locality information is now collected and available to Hadoop >> >> through the CephFS API, so fixing this is certainly possible. However, >> >> there has not been extensive testing. I think the tasks that need to >> >> be completed are (1) make sure that `CephFileSystem` is encoding the >> >> correct block location in `getFileBlockLocations` (which I think it is >> >> currently completed, but does need to be verified), and (2) make sure >> >> rack information is available in the jobtracker, or optionally use a >> >> flat hierarchy (i.e. default-rack). >> >> >> >> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx> wrote: >> >> > Hi There, >> >> > >> >> > I'm test driving Hadoop with CephFS as the storage layer. I was >> >> > running >> >> > the >> >> > Terasort benchmark and I noticed a lot of network IO activity when >> >> > compared >> >> > to a HDFS storage layer setup. (Its a half-a-terabyte sort workload >> >> > over >> >> > two >> >> > data nodes.) >> >> > >> >> > Digging into the job tracker logs a little, I noticed that all the >> >> > map >> >> > tasks >> >> > were being assigned to process a split (block) on non-local nodes >> >> > (which >> >> > explains all the network activity during the map phase) >> >> > >> >> > With Ceph: >> >> > >> >> > >> >> > 2013-07-08 11:19:53,535 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > Input >> >> > size for job job_201307081115_0001 = 500000000000. Number of splits = >> >> > 7452 >> >> > 2013-07-08 11:19:53,538 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > Job >> >> > job_201307081115_0001 initialized successfully with 7452 map tasks >> >> > and >> >> > 32 >> >> > reduce tasks. >> >> > >> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > Choosing a non-local task task_201307081115_0001_m_000000 >> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker: >> >> > Adding >> >> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip >> >> > task_201307081115_0001_m_000000, for tracker >> >> > 'tracker_vega7250:localhost/127.0.0.1:35422' >> >> > >> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > Choosing a non-local task task_201307081115_0001_m_000001 >> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobTracker: >> >> > Adding >> >> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip >> >> > task_201307081115_0001_m_000001, for tracker >> >> > 'tracker_vega7249:localhost/127.0.0.1:36725' >> >> > >> >> > ... and so on. >> >> > >> >> > In comparison with HDFS, the job tracker logs looked something like >> >> > this. >> >> > The maps tasks were being assigned to process data blocks on the >> >> > local >> >> > nodes. >> >> > >> >> > 2013-07-08 03:55:32,656 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > Input >> >> > size for job job_201307080351_0001 = 500000000000. Number of splits = >> >> > 7452 >> >> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > tip:task_201307080351_0001_m_000000 has split on >> >> > node:/default-rack/vega7247 >> >> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > tip:task_201307080351_0001_m_000001 has split on >> >> > node:/default-rack/vega7247 >> >> > 2013-07-08 03:55:34,474 INFO org.apache.hadoop.mapred.JobTracker: >> >> > Adding >> >> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip >> >> > task_201307080351_0001_m_000000, for tracker >> >> > 'tracker_vega7247:localhost/127.0.0.1:43320' >> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > Choosing data-local task task_201307080351_0001_m_000000 >> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobTracker: >> >> > Adding >> >> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip >> >> > task_201307080351_0001_m_000001, for tracker >> >> > 'tracker_vega7247:localhost/127.0.0.1:43320' >> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress: >> >> > Choosing data-local task task_201307080351_0001_m_000001 >> >> > >> >> > Version Info: >> >> > ceph version 0.61.4 >> >> > hadoop 1.1.2 >> >> > >> >> > Has anyone else run into this ? >> >> > >> >> > Thanks >> >> > KC >> >> > >> >> > _______________________________________________ >> >> > ceph-users mailing list >> >> > ceph-users@xxxxxxxxxxxxxx >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > >> > >> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com