Re: Hadoop / Ceph and Data locality ?

Noah Watkins <noah.watkins@xxxxxxxxxxx> · Tue, 9 Jul 2013 10:03:21 -0700

Awesome! Thanks for testing that out. I'll be merging that branch
soon, and will let you know when the new jar is published.

-Noah

On Tue, Jul 9, 2013 at 9:53 AM, ker can <kercan74@xxxxxxxxx> wrote:
> i  applied  the changes from the fs/ceph directory in the -topo branch and
> now it looks better - the map tasks are running on the same nodes as the
> splits they're processing. good stuff !
>
>
> On Mon, Jul 8, 2013 at 9:18 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
> wrote:
>>
>> You might want create a new branch and cherry-pick the topology
>> relevant commits (I think there is 1 or 2) from the -topo branch into
>> cephfs/branch-1.0.. I'm not sure what the -topo branch might be
>> missing as far as bug fixes and such.
>>
>> On Mon, Jul 8, 2013 at 7:11 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> > Yep, I'm running cuttlefish ... I'll try building out of that branch and
>> > let
>> > you know how that goes.
>> >
>> > -KC
>> >
>> >
>> > On Mon, Jul 8, 2013 at 9:06 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> > wrote:
>> >>
>> >> FYI, here is the patch as it currently stands:
>> >>
>> >>
>> >>
>> >> https://github.com/ceph/hadoop-common/compare/cephfs;branch-1.0...cephfs;branch-1.0-topo
>> >>
>> >> I have not tested it recently, but it looks like it should be close to
>> >> correct. Feel free to test it out--I won't be able to get to until
>> >> tomorrow or Wednesday.
>> >>
>> >> Are you running Cuttlefish? I believe it has all the dependencies.
>> >>
>> >> On Mon, Jul 8, 2013 at 7:00 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> >> wrote:
>> >> > KC,
>> >> >
>> >> > Thanks a lot for checking that out. I just went to investigate, and
>> >> > the work we have done on the locality/topology-aware features are
>> >> > sitting in a branch, and have not been merged into the tree that is
>> >> > used to produce the JAR file you are using. I will get that cleaned
>> >> > up
>> >> > and merged soon, and I think that should solve your problems :)
>> >> >
>> >> > -Noah
>> >> >
>> >> > On Mon, Jul 8, 2013 at 6:22 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >> >> hi Noah, okay I think the current version may have a problem haven't
>> >> >> figured
>> >> >> out where yet. Looking at the log messages and how the data blocks
>> >> >> are
>> >> >> distributed among the OSDs.
>> >> >>
>> >> >> So, the job tracker log had for example this output for the map task
>> >> >> for the
>> >> >> first split/block 0 – which it’s executing on host vega7250.
>> >> >>
>> >> >> ....
>> >> >>
>> >> >> ....
>> >> >>
>> >> >> 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker:
>> >> >> Adding
>> >> >> task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip
>> >> >> task_201307081115_0001_m_000000, for tracker
>> >> >> 'tracker_vega7250:localhost/127.0.0.1:35422'
>> >> >>
>> >> >> ...
>> >> >>
>> >> >> ...
>> >> >>
>> >> >>
>> >> >> If I look at how the blocks are divided up among the OSDs, block 0
>> >> >> for
>> >> >> example is managed by OSD#2 – which is running on host vega7249.
>> >> >> However our
>> >> >> map task for block 0 is running on another host.  Definitely not
>> >> >> co-located.
>> >> >>
>> >> >>
>> >> >>
>> >> >>    FILE OFFSET                    OBJECT                    OFFSET
>> >> >> LENGTH
>> >> >> OSD
>> >> >>
>> >> >>                         0      10000000dbe.00000000             0
>> >> >> 67108864         2
>> >> >>
>> >> >>        67108864      10000000dbe.00000001              0
>> >> >> 67108864
>> >> >> 13
>> >> >>
>> >> >>       134217728     10000000dbe.00000002             0      67108864
>> >> >> 5
>> >> >>
>> >> >>       201326592     10000000dbe.00000003             0      67108864
>> >> >> 4
>> >> >>
>> >> >> ….
>> >> >>
>> >> >> ….
>> >> >>
>> >> >>
>> >> >>
>> >> >> Ceph osd tree:
>> >> >>
>> >> >>  # id    weight  type name       up/down reweight
>> >> >>
>> >> >> -1      14      root default
>> >> >>
>> >> >> -3      14              rack unknownrack
>> >> >>
>> >> >> -2      7                       host vega7249
>> >> >>
>> >> >> 0       1                               osd.0   up      1
>> >> >>
>> >> >> 1       1                               osd.1   up      1
>> >> >>
>> >> >> 2       1                               osd.2   up      1
>> >> >>
>> >> >> 3       1                               osd.3   up      1
>> >> >>
>> >> >> 4       1                               osd.4   up      1
>> >> >>
>> >> >> 5       1                               osd.5   up      1
>> >> >>
>> >> >> 6       1                               osd.6   up      1
>> >> >>
>> >> >> -4      7                       host vega7250
>> >> >>
>> >> >> 10      1                               osd.10  up      1
>> >> >>
>> >> >> 11      1                               osd.11  up      1
>> >> >>
>> >> >> 12      1                               osd.12  up      1
>> >> >>
>> >> >> 13      1                               osd.13  up      1
>> >> >>
>> >> >> 7       1                               osd.7   up      1
>> >> >>
>> >> >> 8       1                               osd.8   up      1
>> >> >>
>> >> >> 9       1                               osd.9   up      1
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >> KC
>> >> >>
>> >> >>
>> >> >> On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins
>> >> >> <noah.watkins@xxxxxxxxxxx>
>> >> >> wrote:
>> >> >>>
>> >> >>> Yes, all of the code needed to get the locality information should
>> >> >>> be
>> >> >>> present the version of the jar file you referenced. We have tested
>> >> >>> a
>> >> >>> to make sure the right data is available, but have not extensively
>> >> >>> tested that it is being used correctly by core Hadoop (e.g. that is
>> >> >>> being correctly propagated out of CephFileSystem). IIRC fixing this
>> >> >>> /should/ be pretty easy; fiddling with getFileBlockLocation.
>> >> >>>
>> >> >>> On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >> >>> > Hi Noah,
>> >> >>> >
>> >> >>> > I'm using the CephFS jar from ...
>> >> >>> > http://ceph.com/download/hadoop-cephfs.jar
>> >> >>> > I beleive this is built from hadoop-common/cephfs/branch-1.0 ?
>> >> >>> >
>> >> >>> > If thats the case, I should already be using an implementation
>> >> >>> > thats
>> >> >>> > got
>> >> >>> > getFileBlockLocations() ... which is here
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java
>> >> >>> >
>> >> >>> > Is there a command line tool that I can use to verify the results
>> >> >>> > from
>> >> >>> > getFileBlockLocations() ?
>> >> >>> >
>> >> >>> > thanks
>> >> >>> > KC
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins
>> >> >>> > <noah.watkins@xxxxxxxxxxx>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Hi KC,
>> >> >>> >>
>> >> >>> >> The locality information is now collected and available to
>> >> >>> >> Hadoop
>> >> >>> >> through the CephFS API, so fixing this is certainly possible.
>> >> >>> >> However,
>> >> >>> >> there has not been extensive testing. I think the tasks that
>> >> >>> >> need
>> >> >>> >> to
>> >> >>> >> be completed are (1) make sure that `CephFileSystem` is encoding
>> >> >>> >> the
>> >> >>> >> correct block location in `getFileBlockLocations` (which I think
>> >> >>> >> it
>> >> >>> >> is
>> >> >>> >> currently completed, but does need to be verified), and (2) make
>> >> >>> >> sure
>> >> >>> >> rack information is available in the jobtracker, or optionally
>> >> >>> >> use
>> >> >>> >> a
>> >> >>> >> flat hierarchy (i.e. default-rack).
>> >> >>> >>
>> >> >>> >> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx>
>> >> >>> >> wrote:
>> >> >>> >> > Hi There,
>> >> >>> >> >
>> >> >>> >> > I'm test driving Hadoop with CephFS as the storage layer. I
>> >> >>> >> > was
>> >> >>> >> > running
>> >> >>> >> > the
>> >> >>> >> > Terasort benchmark and  I noticed a lot of network IO activity
>> >> >>> >> > when
>> >> >>> >> > compared
>> >> >>> >> > to a HDFS storage layer setup. (Its a half-a-terabyte sort
>> >> >>> >> > workload
>> >> >>> >> > over
>> >> >>> >> > two
>> >> >>> >> > data nodes.)
>> >> >>> >> >
>> >> >>> >> > Digging into the job tracker logs a little, I noticed that all
>> >> >>> >> > the
>> >> >>> >> > map
>> >> >>> >> > tasks
>> >> >>> >> > were being assigned to process a split (block)  on non-local
>> >> >>> >> > nodes
>> >> >>> >> > (which
>> >> >>> >> > explains all the network activity during the map phase)
>> >> >>> >> >
>> >> >>> >> > With Ceph:
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > 2013-07-08 11:19:53,535 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > Input
>> >> >>> >> > size for job job_201307081115_0001 = 500000000000. Number of
>> >> >>> >> > splits =
>> >> >>> >> > 7452
>> >> >>> >> > 2013-07-08 11:19:53,538 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > Job
>> >> >>> >> > job_201307081115_0001 initialized successfully with 7452 map
>> >> >>> >> > tasks
>> >> >>> >> > and
>> >> >>> >> > 32
>> >> >>> >> > reduce tasks.
>> >> >>> >> >
>> >> >>> >> > 2013-07-08 11:19:54,836 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > Choosing a non-local task task_201307081115_0001_m_000000
>> >> >>> >> > 2013-07-08 11:19:54,836 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobTracker:
>> >> >>> >> > Adding
>> >> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip
>> >> >>> >> > task_201307081115_0001_m_000000, for tracker
>> >> >>> >> > 'tracker_vega7250:localhost/127.0.0.1:35422'
>> >> >>> >> >
>> >> >>> >> > 2013-07-08 11:19:54,990 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > Choosing a non-local task task_201307081115_0001_m_000001
>> >> >>> >> > 2013-07-08 11:19:54,990 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobTracker:
>> >> >>> >> > Adding
>> >> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip
>> >> >>> >> > task_201307081115_0001_m_000001, for tracker
>> >> >>> >> > 'tracker_vega7249:localhost/127.0.0.1:36725'
>> >> >>> >> >
>> >> >>> >> > ... and so on.
>> >> >>> >> >
>> >> >>> >> > In comparison with HDFS, the job tracker logs looked something
>> >> >>> >> > like
>> >> >>> >> > this.
>> >> >>> >> > The maps tasks were being assigned to process data blocks on
>> >> >>> >> > the
>> >> >>> >> > local
>> >> >>> >> > nodes.
>> >> >>> >> >
>> >> >>> >> > 2013-07-08 03:55:32,656 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > Input
>> >> >>> >> > size for job job_201307080351_0001 = 500000000000. Number of
>> >> >>> >> > splits =
>> >> >>> >> > 7452
>> >> >>> >> > 2013-07-08 03:55:32,657 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > tip:task_201307080351_0001_m_000000 has split on
>> >> >>> >> > node:/default-rack/vega7247
>> >> >>> >> > 2013-07-08 03:55:32,657 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > tip:task_201307080351_0001_m_000001 has split on
>> >> >>> >> > node:/default-rack/vega7247
>> >> >>> >> > 2013-07-08 03:55:34,474 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobTracker:
>> >> >>> >> > Adding
>> >> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip
>> >> >>> >> > task_201307080351_0001_m_000000, for tracker
>> >> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'
>> >> >>> >> > 2013-07-08 03:55:34,475 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > Choosing data-local task task_201307080351_0001_m_000000
>> >> >>> >> > 2013-07-08 03:55:34,475 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobTracker:
>> >> >>> >> > Adding
>> >> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip
>> >> >>> >> > task_201307080351_0001_m_000001, for tracker
>> >> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'
>> >> >>> >> > 2013-07-08 03:55:34,475 INFO
>> >> >>> >> > org.apache.hadoop.mapred.JobInProgress:
>> >> >>> >> > Choosing data-local task task_201307080351_0001_m_000001
>> >> >>> >> >
>> >> >>> >> > Version Info:
>> >> >>> >> > ceph version 0.61.4
>> >> >>> >> > hadoop 1.1.2
>> >> >>> >> >
>> >> >>> >> > Has anyone else run into this ?
>> >> >>> >> >
>> >> >>> >> > Thanks
>> >> >>> >> > KC
>> >> >>> >> >
>> >> >>> >> > _______________________________________________
>> >> >>> >> > ceph-users mailing list
>> >> >>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>> >> >
>> >> >>> >
>> >> >>> >
>> >> >>
>> >> >>
>> >
>> >
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com