i applied the changes from the fs/ceph directory in the -topo branch and now it looks better - the map tasks are running on the same nodes as the splits they're processing. good stuff !
On Mon, Jul 8, 2013 at 9:18 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:
You might want create a new branch and cherry-pick the topology
relevant commits (I think there is 1 or 2) from the -topo branch into
cephfs/branch-1.0.. I'm not sure what the -topo branch might be
missing as far as bug fixes and such.
On Mon, Jul 8, 2013 at 7:11 PM, ker can <kercan74@xxxxxxxxx> wrote:
> Yep, I'm running cuttlefish ... I'll try building out of that branch and let
> you know how that goes.
>
> -KC
>
>
> On Mon, Jul 8, 2013 at 9:06 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
> wrote:
>>
>> FYI, here is the patch as it currently stands:
>>
>>
>> https://github.com/ceph/hadoop-common/compare/cephfs;branch-1.0...cephfs;branch-1.0-topo
>>
>> I have not tested it recently, but it looks like it should be close to
>> correct. Feel free to test it out--I won't be able to get to until
>> tomorrow or Wednesday.
>>
>> Are you running Cuttlefish? I believe it has all the dependencies.
>>
>> On Mon, Jul 8, 2013 at 7:00 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> wrote:
>> > KC,
>> >
>> > Thanks a lot for checking that out. I just went to investigate, and
>> > the work we have done on the locality/topology-aware features are
>> > sitting in a branch, and have not been merged into the tree that is
>> > used to produce the JAR file you are using. I will get that cleaned up
>> > and merged soon, and I think that should solve your problems :)
>> >
>> > -Noah
>> >
>> > On Mon, Jul 8, 2013 at 6:22 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >> hi Noah, okay I think the current version may have a problem haven't
>> >> figured
>> >> out where yet. Looking at the log messages and how the data blocks are
>> >> distributed among the OSDs.
>> >>
>> >> So, the job tracker log had for example this output for the map task
>> >> for the
>> >> first split/block 0 – which it’s executing on host vega7250.
>> >>
>> >> ....
>> >>
>> >> ....
>> >>
>> >> 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker:
>> >> Adding
>> >> task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip
>> >> task_201307081115_0001_m_000000, for tracker
>> >> 'tracker_vega7250:localhost/127.0.0.1:35422'
>> >>
>> >> ...
>> >>
>> >> ...
>> >>
>> >>
>> >> If I look at how the blocks are divided up among the OSDs, block 0 for
>> >> example is managed by OSD#2 – which is running on host vega7249.
>> >> However our
>> >> map task for block 0 is running on another host. Definitely not
>> >> co-located.
>> >>
>> >>
>> >>
>> >> FILE OFFSET OBJECT OFFSET
>> >> LENGTH
>> >> OSD
>> >>
>> >> 0 10000000dbe.00000000 0
>> >> 67108864 2
>> >>
>> >> 67108864 10000000dbe.00000001 0 67108864
>> >> 13
>> >>
>> >> 134217728 10000000dbe.00000002 0 67108864
>> >> 5
>> >>
>> >> 201326592 10000000dbe.00000003 0 67108864
>> >> 4
>> >>
>> >> ….
>> >>
>> >> ….
>> >>
>> >>
>> >>
>> >> Ceph osd tree:
>> >>
>> >> # id weight type name up/down reweight
>> >>
>> >> -1 14 root default
>> >>
>> >> -3 14 rack unknownrack
>> >>
>> >> -2 7 host vega7249
>> >>
>> >> 0 1 osd.0 up 1
>> >>
>> >> 1 1 osd.1 up 1
>> >>
>> >> 2 1 osd.2 up 1
>> >>
>> >> 3 1 osd.3 up 1
>> >>
>> >> 4 1 osd.4 up 1
>> >>
>> >> 5 1 osd.5 up 1
>> >>
>> >> 6 1 osd.6 up 1
>> >>
>> >> -4 7 host vega7250
>> >>
>> >> 10 1 osd.10 up 1
>> >>
>> >> 11 1 osd.11 up 1
>> >>
>> >> 12 1 osd.12 up 1
>> >>
>> >> 13 1 osd.13 up 1
>> >>
>> >> 7 1 osd.7 up 1
>> >>
>> >> 8 1 osd.8 up 1
>> >>
>> >> 9 1 osd.9 up 1
>> >>
>> >>
>> >> Thanks
>> >> KC
>> >>
>> >>
>> >> On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> >> wrote:
>> >>>
>> >>> Yes, all of the code needed to get the locality information should be
>> >>> present the version of the jar file you referenced. We have tested a
>> >>> to make sure the right data is available, but have not extensively
>> >>> tested that it is being used correctly by core Hadoop (e.g. that is
>> >>> being correctly propagated out of CephFileSystem). IIRC fixing this
>> >>> /should/ be pretty easy; fiddling with getFileBlockLocation.
>> >>>
>> >>> On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >>> > Hi Noah,
>> >>> >
>> >>> > I'm using the CephFS jar from ...
>> >>> > http://ceph.com/download/hadoop-cephfs.jar
>> >>> > I beleive this is built from hadoop-common/cephfs/branch-1.0 ?
>> >>> >
>> >>> > If thats the case, I should already be using an implementation thats
>> >>> > got
>> >>> > getFileBlockLocations() ... which is here
>> >>> >
>> >>> >
>> >>> > https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java
>> >>> >
>> >>> > Is there a command line tool that I can use to verify the results
>> >>> > from
>> >>> > getFileBlockLocations() ?
>> >>> >
>> >>> > thanks
>> >>> > KC
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins
>> >>> > <noah.watkins@xxxxxxxxxxx>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi KC,
>> >>> >>
>> >>> >> The locality information is now collected and available to Hadoop
>> >>> >> through the CephFS API, so fixing this is certainly possible.
>> >>> >> However,
>> >>> >> there has not been extensive testing. I think the tasks that need
>> >>> >> to
>> >>> >> be completed are (1) make sure that `CephFileSystem` is encoding
>> >>> >> the
>> >>> >> correct block location in `getFileBlockLocations` (which I think it
>> >>> >> is
>> >>> >> currently completed, but does need to be verified), and (2) make
>> >>> >> sure
>> >>> >> rack information is available in the jobtracker, or optionally use
>> >>> >> a
>> >>> >> flat hierarchy (i.e. default-rack).
>> >>> >>
>> >>> >> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx>
>> >>> >> wrote:
>> >>> >> > Hi There,
>> >>> >> >
>> >>> >> > I'm test driving Hadoop with CephFS as the storage layer. I was
>> >>> >> > running
>> >>> >> > the
>> >>> >> > Terasort benchmark and I noticed a lot of network IO activity
>> >>> >> > when
>> >>> >> > compared
>> >>> >> > to a HDFS storage layer setup. (Its a half-a-terabyte sort
>> >>> >> > workload
>> >>> >> > over
>> >>> >> > two
>> >>> >> > data nodes.)
>> >>> >> >
>> >>> >> > Digging into the job tracker logs a little, I noticed that all
>> >>> >> > the
>> >>> >> > map
>> >>> >> > tasks
>> >>> >> > were being assigned to process a split (block) on non-local
>> >>> >> > nodes
>> >>> >> > (which
>> >>> >> > explains all the network activity during the map phase)
>> >>> >> >
>> >>> >> > With Ceph:
>> >>> >> >
>> >>> >> >
>> >>> >> > 2013-07-08 11:19:53,535 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > Input
>> >>> >> > size for job job_201307081115_0001 = 500000000000. Number of
>> >>> >> > splits =
>> >>> >> > 7452
>> >>> >> > 2013-07-08 11:19:53,538 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > Job
>> >>> >> > job_201307081115_0001 initialized successfully with 7452 map
>> >>> >> > tasks
>> >>> >> > and
>> >>> >> > 32
>> >>> >> > reduce tasks.
>> >>> >> >
>> >>> >> > 2013-07-08 11:19:54,836 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > Choosing a non-local task task_201307081115_0001_m_000000
>> >>> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker:
>> >>> >> > Adding
>> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip
>> >>> >> > task_201307081115_0001_m_000000, for tracker
>> >>> >> > 'tracker_vega7250:localhost/127.0.0.1:35422'
>> >>> >> >
>> >>> >> > 2013-07-08 11:19:54,990 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > Choosing a non-local task task_201307081115_0001_m_000001
>> >>> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobTracker:
>> >>> >> > Adding
>> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip
>> >>> >> > task_201307081115_0001_m_000001, for tracker
>> >>> >> > 'tracker_vega7249:localhost/127.0.0.1:36725'
>> >>> >> >
>> >>> >> > ... and so on.
>> >>> >> >
>> >>> >> > In comparison with HDFS, the job tracker logs looked something
>> >>> >> > like
>> >>> >> > this.
>> >>> >> > The maps tasks were being assigned to process data blocks on the
>> >>> >> > local
>> >>> >> > nodes.
>> >>> >> >
>> >>> >> > 2013-07-08 03:55:32,656 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > Input
>> >>> >> > size for job job_201307080351_0001 = 500000000000. Number of
>> >>> >> > splits =
>> >>> >> > 7452
>> >>> >> > 2013-07-08 03:55:32,657 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > tip:task_201307080351_0001_m_000000 has split on
>> >>> >> > node:/default-rack/vega7247
>> >>> >> > 2013-07-08 03:55:32,657 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > tip:task_201307080351_0001_m_000001 has split on
>> >>> >> > node:/default-rack/vega7247
>> >>> >> > 2013-07-08 03:55:34,474 INFO org.apache.hadoop.mapred.JobTracker:
>> >>> >> > Adding
>> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip
>> >>> >> > task_201307080351_0001_m_000000, for tracker
>> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'
>> >>> >> > 2013-07-08 03:55:34,475 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > Choosing data-local task task_201307080351_0001_m_000000
>> >>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobTracker:
>> >>> >> > Adding
>> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip
>> >>> >> > task_201307080351_0001_m_000001, for tracker
>> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'
>> >>> >> > 2013-07-08 03:55:34,475 INFO
>> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >>> >> > Choosing data-local task task_201307080351_0001_m_000001
>> >>> >> >
>> >>> >> > Version Info:
>> >>> >> > ceph version 0.61.4
>> >>> >> > hadoop 1.1.2
>> >>> >> >
>> >>> >> > Has anyone else run into this ?
>> >>> >> >
>> >>> >> > Thanks
>> >>> >> > KC
>> >>> >> >
>> >>> >> > _______________________________________________
>> >>> >> > ceph-users mailing list
>> >>> >> > ceph-users@xxxxxxxxxxxxxx
>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>> >> >
>> >>> >
>> >>> >
>> >>
>> >>
>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com