Re: Hadoop / Ceph and Data locality ?

Noah Watkins <noah.watkins@xxxxxxxxxxx> · Mon, 8 Jul 2013 19:00:54 -0700

KC,

Thanks a lot for checking that out. I just went to investigate, and
the work we have done on the locality/topology-aware features are
sitting in a branch, and have not been merged into the tree that is
used to produce the JAR file you are using. I will get that cleaned up
and merged soon, and I think that should solve your problems :)

-Noah

On Mon, Jul 8, 2013 at 6:22 PM, ker can <kercan74@xxxxxxxxx> wrote:
> hi Noah, okay I think the current version may have a problem haven't figured
> out where yet. Looking at the log messages and how the data blocks are
> distributed among the OSDs.
>
> So, the job tracker log had for example this output for the map task for the
> first split/block 0 – which it’s executing on host vega7250.
>
> ....
>
> ....
>
> 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip
> task_201307081115_0001_m_000000, for tracker
> 'tracker_vega7250:localhost/127.0.0.1:35422'
>
> ...
>
> ...
>
>
> If I look at how the blocks are divided up among the OSDs, block 0 for
> example is managed by OSD#2 – which is running on host vega7249. However our
> map task for block 0 is running on another host.  Definitely not co-located.
>
>
>
>    FILE OFFSET                    OBJECT                    OFFSET   LENGTH
> OSD
>
>                         0      10000000dbe.00000000             0
> 67108864         2
>
>        67108864      10000000dbe.00000001              0      67108864
> 13
>
>       134217728     10000000dbe.00000002             0      67108864
> 5
>
>       201326592     10000000dbe.00000003             0      67108864
> 4
>
> ….
>
> ….
>
>
>
> Ceph osd tree:
>
>  # id    weight  type name       up/down reweight
>
> -1      14      root default
>
> -3      14              rack unknownrack
>
> -2      7                       host vega7249
>
> 0       1                               osd.0   up      1
>
> 1       1                               osd.1   up      1
>
> 2       1                               osd.2   up      1
>
> 3       1                               osd.3   up      1
>
> 4       1                               osd.4   up      1
>
> 5       1                               osd.5   up      1
>
> 6       1                               osd.6   up      1
>
> -4      7                       host vega7250
>
> 10      1                               osd.10  up      1
>
> 11      1                               osd.11  up      1
>
> 12      1                               osd.12  up      1
>
> 13      1                               osd.13  up      1
>
> 7       1                               osd.7   up      1
>
> 8       1                               osd.8   up      1
>
> 9       1                               osd.9   up      1
>
>
> Thanks
> KC
>
>
> On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
> wrote:
>>
>> Yes, all of the code needed to get the locality information should be
>> present the version of the jar file you referenced. We have tested a
>> to make sure the right data is available, but have not extensively
>> tested that it is being used correctly by core Hadoop (e.g. that is
>> being correctly propagated out of CephFileSystem). IIRC fixing this
>> /should/ be pretty easy; fiddling with getFileBlockLocation.
>>
>> On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> > Hi Noah,
>> >
>> > I'm using the CephFS jar from ...
>> > http://ceph.com/download/hadoop-cephfs.jar
>> > I beleive this is built from hadoop-common/cephfs/branch-1.0 ?
>> >
>> > If thats the case, I should already be using an implementation thats got
>> > getFileBlockLocations() ... which is here
>> >
>> > https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java
>> >
>> > Is there a command line tool that I can use to verify the results from
>> > getFileBlockLocations() ?
>> >
>> > thanks
>> > KC
>> >
>> >
>> >
>> > On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> > wrote:
>> >>
>> >> Hi KC,
>> >>
>> >> The locality information is now collected and available to Hadoop
>> >> through the CephFS API, so fixing this is certainly possible. However,
>> >> there has not been extensive testing. I think the tasks that need to
>> >> be completed are (1) make sure that `CephFileSystem` is encoding the
>> >> correct block location in `getFileBlockLocations` (which I think it is
>> >> currently completed, but does need to be verified), and (2) make sure
>> >> rack information is available in the jobtracker, or optionally use a
>> >> flat hierarchy (i.e. default-rack).
>> >>
>> >> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >> > Hi There,
>> >> >
>> >> > I'm test driving Hadoop with CephFS as the storage layer. I was
>> >> > running
>> >> > the
>> >> > Terasort benchmark and  I noticed a lot of network IO activity when
>> >> > compared
>> >> > to a HDFS storage layer setup. (Its a half-a-terabyte sort workload
>> >> > over
>> >> > two
>> >> > data nodes.)
>> >> >
>> >> > Digging into the job tracker logs a little, I noticed that all the
>> >> > map
>> >> > tasks
>> >> > were being assigned to process a split (block)  on non-local nodes
>> >> > (which
>> >> > explains all the network activity during the map phase)
>> >> >
>> >> > With Ceph:
>> >> >
>> >> >
>> >> > 2013-07-08 11:19:53,535 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > Input
>> >> > size for job job_201307081115_0001 = 500000000000. Number of splits =
>> >> > 7452
>> >> > 2013-07-08 11:19:53,538 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > Job
>> >> > job_201307081115_0001 initialized successfully with 7452 map tasks
>> >> > and
>> >> > 32
>> >> > reduce tasks.
>> >> >
>> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > Choosing a non-local task task_201307081115_0001_m_000000
>> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker:
>> >> > Adding
>> >> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip
>> >> > task_201307081115_0001_m_000000, for tracker
>> >> > 'tracker_vega7250:localhost/127.0.0.1:35422'
>> >> >
>> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > Choosing a non-local task task_201307081115_0001_m_000001
>> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobTracker:
>> >> > Adding
>> >> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip
>> >> > task_201307081115_0001_m_000001, for tracker
>> >> > 'tracker_vega7249:localhost/127.0.0.1:36725'
>> >> >
>> >> > ... and so on.
>> >> >
>> >> > In comparison with HDFS, the job tracker logs looked something like
>> >> > this.
>> >> > The maps tasks were being assigned to process data blocks on the
>> >> > local
>> >> > nodes.
>> >> >
>> >> > 2013-07-08 03:55:32,656 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > Input
>> >> > size for job job_201307080351_0001 = 500000000000. Number of splits =
>> >> > 7452
>> >> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > tip:task_201307080351_0001_m_000000 has split on
>> >> > node:/default-rack/vega7247
>> >> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > tip:task_201307080351_0001_m_000001 has split on
>> >> > node:/default-rack/vega7247
>> >> > 2013-07-08 03:55:34,474 INFO org.apache.hadoop.mapred.JobTracker:
>> >> > Adding
>> >> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip
>> >> > task_201307080351_0001_m_000000, for tracker
>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'
>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > Choosing data-local task task_201307080351_0001_m_000000
>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobTracker:
>> >> > Adding
>> >> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip
>> >> > task_201307080351_0001_m_000001, for tracker
>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'
>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress:
>> >> > Choosing data-local task task_201307080351_0001_m_000001
>> >> >
>> >> > Version Info:
>> >> > ceph version 0.61.4
>> >> > hadoop 1.1.2
>> >> >
>> >> > Has anyone else run into this ?
>> >> >
>> >> > Thanks
>> >> > KC
>> >> >
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com