Awesome! Thanks for testing that out. I'll be merging that branch soon, and will let you know when the new jar is published. -Noah On Tue, Jul 9, 2013 at 9:53 AM, ker can <kercan74@xxxxxxxxx> wrote: > i applied the changes from the fs/ceph directory in the -topo branch and > now it looks better - the map tasks are running on the same nodes as the > splits they're processing. good stuff ! > > > On Mon, Jul 8, 2013 at 9:18 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> > wrote: >> >> You might want create a new branch and cherry-pick the topology >> relevant commits (I think there is 1 or 2) from the -topo branch into >> cephfs/branch-1.0.. I'm not sure what the -topo branch might be >> missing as far as bug fixes and such. >> >> On Mon, Jul 8, 2013 at 7:11 PM, ker can <kercan74@xxxxxxxxx> wrote: >> > Yep, I'm running cuttlefish ... I'll try building out of that branch and >> > let >> > you know how that goes. >> > >> > -KC >> > >> > >> > On Mon, Jul 8, 2013 at 9:06 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> >> > wrote: >> >> >> >> FYI, here is the patch as it currently stands: >> >> >> >> >> >> >> >> https://github.com/ceph/hadoop-common/compare/cephfs;branch-1.0...cephfs;branch-1.0-topo >> >> >> >> I have not tested it recently, but it looks like it should be close to >> >> correct. Feel free to test it out--I won't be able to get to until >> >> tomorrow or Wednesday. >> >> >> >> Are you running Cuttlefish? I believe it has all the dependencies. >> >> >> >> On Mon, Jul 8, 2013 at 7:00 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> >> >> wrote: >> >> > KC, >> >> > >> >> > Thanks a lot for checking that out. I just went to investigate, and >> >> > the work we have done on the locality/topology-aware features are >> >> > sitting in a branch, and have not been merged into the tree that is >> >> > used to produce the JAR file you are using. I will get that cleaned >> >> > up >> >> > and merged soon, and I think that should solve your problems :) >> >> > >> >> > -Noah >> >> > >> >> > On Mon, Jul 8, 2013 at 6:22 PM, ker can <kercan74@xxxxxxxxx> wrote: >> >> >> hi Noah, okay I think the current version may have a problem haven't >> >> >> figured >> >> >> out where yet. Looking at the log messages and how the data blocks >> >> >> are >> >> >> distributed among the OSDs. >> >> >> >> >> >> So, the job tracker log had for example this output for the map task >> >> >> for the >> >> >> first split/block 0 – which it’s executing on host vega7250. >> >> >> >> >> >> .... >> >> >> >> >> >> .... >> >> >> >> >> >> 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker: >> >> >> Adding >> >> >> task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip >> >> >> task_201307081115_0001_m_000000, for tracker >> >> >> 'tracker_vega7250:localhost/127.0.0.1:35422' >> >> >> >> >> >> ... >> >> >> >> >> >> ... >> >> >> >> >> >> >> >> >> If I look at how the blocks are divided up among the OSDs, block 0 >> >> >> for >> >> >> example is managed by OSD#2 – which is running on host vega7249. >> >> >> However our >> >> >> map task for block 0 is running on another host. Definitely not >> >> >> co-located. >> >> >> >> >> >> >> >> >> >> >> >> FILE OFFSET OBJECT OFFSET >> >> >> LENGTH >> >> >> OSD >> >> >> >> >> >> 0 10000000dbe.00000000 0 >> >> >> 67108864 2 >> >> >> >> >> >> 67108864 10000000dbe.00000001 0 >> >> >> 67108864 >> >> >> 13 >> >> >> >> >> >> 134217728 10000000dbe.00000002 0 67108864 >> >> >> 5 >> >> >> >> >> >> 201326592 10000000dbe.00000003 0 67108864 >> >> >> 4 >> >> >> >> >> >> …. >> >> >> >> >> >> …. >> >> >> >> >> >> >> >> >> >> >> >> Ceph osd tree: >> >> >> >> >> >> # id weight type name up/down reweight >> >> >> >> >> >> -1 14 root default >> >> >> >> >> >> -3 14 rack unknownrack >> >> >> >> >> >> -2 7 host vega7249 >> >> >> >> >> >> 0 1 osd.0 up 1 >> >> >> >> >> >> 1 1 osd.1 up 1 >> >> >> >> >> >> 2 1 osd.2 up 1 >> >> >> >> >> >> 3 1 osd.3 up 1 >> >> >> >> >> >> 4 1 osd.4 up 1 >> >> >> >> >> >> 5 1 osd.5 up 1 >> >> >> >> >> >> 6 1 osd.6 up 1 >> >> >> >> >> >> -4 7 host vega7250 >> >> >> >> >> >> 10 1 osd.10 up 1 >> >> >> >> >> >> 11 1 osd.11 up 1 >> >> >> >> >> >> 12 1 osd.12 up 1 >> >> >> >> >> >> 13 1 osd.13 up 1 >> >> >> >> >> >> 7 1 osd.7 up 1 >> >> >> >> >> >> 8 1 osd.8 up 1 >> >> >> >> >> >> 9 1 osd.9 up 1 >> >> >> >> >> >> >> >> >> Thanks >> >> >> KC >> >> >> >> >> >> >> >> >> On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins >> >> >> <noah.watkins@xxxxxxxxxxx> >> >> >> wrote: >> >> >>> >> >> >>> Yes, all of the code needed to get the locality information should >> >> >>> be >> >> >>> present the version of the jar file you referenced. We have tested >> >> >>> a >> >> >>> to make sure the right data is available, but have not extensively >> >> >>> tested that it is being used correctly by core Hadoop (e.g. that is >> >> >>> being correctly propagated out of CephFileSystem). IIRC fixing this >> >> >>> /should/ be pretty easy; fiddling with getFileBlockLocation. >> >> >>> >> >> >>> On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote: >> >> >>> > Hi Noah, >> >> >>> > >> >> >>> > I'm using the CephFS jar from ... >> >> >>> > http://ceph.com/download/hadoop-cephfs.jar >> >> >>> > I beleive this is built from hadoop-common/cephfs/branch-1.0 ? >> >> >>> > >> >> >>> > If thats the case, I should already be using an implementation >> >> >>> > thats >> >> >>> > got >> >> >>> > getFileBlockLocations() ... which is here >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java >> >> >>> > >> >> >>> > Is there a command line tool that I can use to verify the results >> >> >>> > from >> >> >>> > getFileBlockLocations() ? >> >> >>> > >> >> >>> > thanks >> >> >>> > KC >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins >> >> >>> > <noah.watkins@xxxxxxxxxxx> >> >> >>> > wrote: >> >> >>> >> >> >> >>> >> Hi KC, >> >> >>> >> >> >> >>> >> The locality information is now collected and available to >> >> >>> >> Hadoop >> >> >>> >> through the CephFS API, so fixing this is certainly possible. >> >> >>> >> However, >> >> >>> >> there has not been extensive testing. I think the tasks that >> >> >>> >> need >> >> >>> >> to >> >> >>> >> be completed are (1) make sure that `CephFileSystem` is encoding >> >> >>> >> the >> >> >>> >> correct block location in `getFileBlockLocations` (which I think >> >> >>> >> it >> >> >>> >> is >> >> >>> >> currently completed, but does need to be verified), and (2) make >> >> >>> >> sure >> >> >>> >> rack information is available in the jobtracker, or optionally >> >> >>> >> use >> >> >>> >> a >> >> >>> >> flat hierarchy (i.e. default-rack). >> >> >>> >> >> >> >>> >> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx> >> >> >>> >> wrote: >> >> >>> >> > Hi There, >> >> >>> >> > >> >> >>> >> > I'm test driving Hadoop with CephFS as the storage layer. I >> >> >>> >> > was >> >> >>> >> > running >> >> >>> >> > the >> >> >>> >> > Terasort benchmark and I noticed a lot of network IO activity >> >> >>> >> > when >> >> >>> >> > compared >> >> >>> >> > to a HDFS storage layer setup. (Its a half-a-terabyte sort >> >> >>> >> > workload >> >> >>> >> > over >> >> >>> >> > two >> >> >>> >> > data nodes.) >> >> >>> >> > >> >> >>> >> > Digging into the job tracker logs a little, I noticed that all >> >> >>> >> > the >> >> >>> >> > map >> >> >>> >> > tasks >> >> >>> >> > were being assigned to process a split (block) on non-local >> >> >>> >> > nodes >> >> >>> >> > (which >> >> >>> >> > explains all the network activity during the map phase) >> >> >>> >> > >> >> >>> >> > With Ceph: >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > 2013-07-08 11:19:53,535 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > Input >> >> >>> >> > size for job job_201307081115_0001 = 500000000000. Number of >> >> >>> >> > splits = >> >> >>> >> > 7452 >> >> >>> >> > 2013-07-08 11:19:53,538 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > Job >> >> >>> >> > job_201307081115_0001 initialized successfully with 7452 map >> >> >>> >> > tasks >> >> >>> >> > and >> >> >>> >> > 32 >> >> >>> >> > reduce tasks. >> >> >>> >> > >> >> >>> >> > 2013-07-08 11:19:54,836 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > Choosing a non-local task task_201307081115_0001_m_000000 >> >> >>> >> > 2013-07-08 11:19:54,836 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobTracker: >> >> >>> >> > Adding >> >> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip >> >> >>> >> > task_201307081115_0001_m_000000, for tracker >> >> >>> >> > 'tracker_vega7250:localhost/127.0.0.1:35422' >> >> >>> >> > >> >> >>> >> > 2013-07-08 11:19:54,990 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > Choosing a non-local task task_201307081115_0001_m_000001 >> >> >>> >> > 2013-07-08 11:19:54,990 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobTracker: >> >> >>> >> > Adding >> >> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip >> >> >>> >> > task_201307081115_0001_m_000001, for tracker >> >> >>> >> > 'tracker_vega7249:localhost/127.0.0.1:36725' >> >> >>> >> > >> >> >>> >> > ... and so on. >> >> >>> >> > >> >> >>> >> > In comparison with HDFS, the job tracker logs looked something >> >> >>> >> > like >> >> >>> >> > this. >> >> >>> >> > The maps tasks were being assigned to process data blocks on >> >> >>> >> > the >> >> >>> >> > local >> >> >>> >> > nodes. >> >> >>> >> > >> >> >>> >> > 2013-07-08 03:55:32,656 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > Input >> >> >>> >> > size for job job_201307080351_0001 = 500000000000. Number of >> >> >>> >> > splits = >> >> >>> >> > 7452 >> >> >>> >> > 2013-07-08 03:55:32,657 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > tip:task_201307080351_0001_m_000000 has split on >> >> >>> >> > node:/default-rack/vega7247 >> >> >>> >> > 2013-07-08 03:55:32,657 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > tip:task_201307080351_0001_m_000001 has split on >> >> >>> >> > node:/default-rack/vega7247 >> >> >>> >> > 2013-07-08 03:55:34,474 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobTracker: >> >> >>> >> > Adding >> >> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip >> >> >>> >> > task_201307080351_0001_m_000000, for tracker >> >> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320' >> >> >>> >> > 2013-07-08 03:55:34,475 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > Choosing data-local task task_201307080351_0001_m_000000 >> >> >>> >> > 2013-07-08 03:55:34,475 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobTracker: >> >> >>> >> > Adding >> >> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip >> >> >>> >> > task_201307080351_0001_m_000001, for tracker >> >> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320' >> >> >>> >> > 2013-07-08 03:55:34,475 INFO >> >> >>> >> > org.apache.hadoop.mapred.JobInProgress: >> >> >>> >> > Choosing data-local task task_201307080351_0001_m_000001 >> >> >>> >> > >> >> >>> >> > Version Info: >> >> >>> >> > ceph version 0.61.4 >> >> >>> >> > hadoop 1.1.2 >> >> >>> >> > >> >> >>> >> > Has anyone else run into this ? >> >> >>> >> > >> >> >>> >> > Thanks >> >> >>> >> > KC >> >> >>> >> > >> >> >>> >> > _______________________________________________ >> >> >>> >> > ceph-users mailing list >> >> >>> >> > ceph-users@xxxxxxxxxxxxxx >> >> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >>> >> > >> >> >>> > >> >> >>> > >> >> >> >> >> >> >> > >> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com