Re: Hadoop / Ceph and Data locality ?

ker can <kercan74@xxxxxxxxx> · Tue, 9 Jul 2013 11:53:43 -0500

i  applied  the changes from the fs/ceph directory in the -topo branch and now it looks better - the map tasks are running on the same nodes as the splits they're processing. good stuff !

On Mon, Jul 8, 2013 at 9:18 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:

You might want create a new branch and cherry-pick the topology

relevant commits (I think there is 1 or 2) from the -topo branch into

cephfs/branch-1.0.. I'm not sure what the -topo branch might be

missing as far as bug fixes and such.

On Mon, Jul 8, 2013 at 7:11 PM, ker can <kercan74@xxxxxxxxx> wrote:

> Yep, I'm running cuttlefish ... I'll try building out of that branch and let

> you know how that goes.

>

> -KC

>

>

> On Mon, Jul 8, 2013 at 9:06 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

> wrote:

>>

>> FYI, here is the patch as it currently stands:

>>

>>

>> https://github.com/ceph/hadoop-common/compare/cephfs;branch-1.0...cephfs;branch-1.0-topo

>>

>> I have not tested it recently, but it looks like it should be close to

>> correct. Feel free to test it out--I won't be able to get to until

>> tomorrow or Wednesday.

>>

>> Are you running Cuttlefish? I believe it has all the dependencies.

>>

>> On Mon, Jul 8, 2013 at 7:00 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>> wrote:

>> > KC,

>> >

>> > Thanks a lot for checking that out. I just went to investigate, and

>> > the work we have done on the locality/topology-aware features are

>> > sitting in a branch, and have not been merged into the tree that is

>> > used to produce the JAR file you are using. I will get that cleaned up

>> > and merged soon, and I think that should solve your problems :)

>> >

>> > -Noah

>> >

>> > On Mon, Jul 8, 2013 at 6:22 PM, ker can <kercan74@xxxxxxxxx> wrote:

>> >> hi Noah, okay I think the current version may have a problem haven't

>> >> figured

>> >> out where yet. Looking at the log messages and how the data blocks are

>> >> distributed among the OSDs.

>> >>

>> >> So, the job tracker log had for example this output for the map task

>> >> for the

>> >> first split/block 0 – which it’s executing on host vega7250.

>> >>

>> >> ....

>> >>

>> >> ....

>> >>

>> >> 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker:

>> >> Adding

>> >> task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip

>> >> task_201307081115_0001_m_000000, for tracker

>> >> 'tracker_vega7250:localhost/127.0.0.1:35422'

>> >>

>> >> ...

>> >>

>> >> ...

>> >>

>> >>

>> >> If I look at how the blocks are divided up among the OSDs, block 0 for

>> >> example is managed by OSD#2 – which is running on host vega7249.

>> >> However our

>> >> map task for block 0 is running on another host.  Definitely not

>> >> co-located.

>> >>

>> >>

>> >>

>> >>    FILE OFFSET                    OBJECT                    OFFSET

>> >> LENGTH

>> >> OSD

>> >>

>> >>                         0      10000000dbe.00000000             0

>> >> 67108864         2

>> >>

>> >>        67108864      10000000dbe.00000001              0      67108864

>> >> 13

>> >>

>> >>       134217728     10000000dbe.00000002             0      67108864

>> >> 5

>> >>

>> >>       201326592     10000000dbe.00000003             0      67108864

>> >> 4

>> >>

>> >> ….

>> >>

>> >> ….

>> >>

>> >>

>> >>

>> >> Ceph osd tree:

>> >>

>> >>  # id    weight  type name       up/down reweight

>> >>

>> >> -1      14      root default

>> >>

>> >> -3      14              rack unknownrack

>> >>

>> >> -2      7                       host vega7249

>> >>

>> >> 0       1                               osd.0   up      1

>> >>

>> >> 1       1                               osd.1   up      1

>> >>

>> >> 2       1                               osd.2   up      1

>> >>

>> >> 3       1                               osd.3   up      1

>> >>

>> >> 4       1                               osd.4   up      1

>> >>

>> >> 5       1                               osd.5   up      1

>> >>

>> >> 6       1                               osd.6   up      1

>> >>

>> >> -4      7                       host vega7250

>> >>

>> >> 10      1                               osd.10  up      1

>> >>

>> >> 11      1                               osd.11  up      1

>> >>

>> >> 12      1                               osd.12  up      1

>> >>

>> >> 13      1                               osd.13  up      1

>> >>

>> >> 7       1                               osd.7   up      1

>> >>

>> >> 8       1                               osd.8   up      1

>> >>

>> >> 9       1                               osd.9   up      1

>> >>

>> >>

>> >> Thanks

>> >> KC

>> >>

>> >>

>> >> On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>> >> wrote:

>> >>>

>> >>> Yes, all of the code needed to get the locality information should be

>> >>> present the version of the jar file you referenced. We have tested a

>> >>> to make sure the right data is available, but have not extensively

>> >>> tested that it is being used correctly by core Hadoop (e.g. that is

>> >>> being correctly propagated out of CephFileSystem). IIRC fixing this

>> >>> /should/ be pretty easy; fiddling with getFileBlockLocation.

>> >>>

>> >>> On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote:

>> >>> > Hi Noah,

>> >>> >

>> >>> > I'm using the CephFS jar from ...

>> >>> > http://ceph.com/download/hadoop-cephfs.jar

>> >>> > I beleive this is built from hadoop-common/cephfs/branch-1.0 ?

>> >>> >

>> >>> > If thats the case, I should already be using an implementation thats

>> >>> > got

>> >>> > getFileBlockLocations() ... which is here

>> >>> >

>> >>> >

>> >>> > https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java

>> >>> >

>> >>> > Is there a command line tool that I can use to verify the results

>> >>> > from

>> >>> > getFileBlockLocations() ?

>> >>> >

>> >>> > thanks

>> >>> > KC

>> >>> >

>> >>> >

>> >>> >

>> >>> > On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins

>> >>> > <noah.watkins@xxxxxxxxxxx>

>> >>> > wrote:

>> >>> >>

>> >>> >> Hi KC,

>> >>> >>

>> >>> >> The locality information is now collected and available to Hadoop

>> >>> >> through the CephFS API, so fixing this is certainly possible.

>> >>> >> However,

>> >>> >> there has not been extensive testing. I think the tasks that need

>> >>> >> to

>> >>> >> be completed are (1) make sure that `CephFileSystem` is encoding

>> >>> >> the

>> >>> >> correct block location in `getFileBlockLocations` (which I think it

>> >>> >> is

>> >>> >> currently completed, but does need to be verified), and (2) make

>> >>> >> sure

>> >>> >> rack information is available in the jobtracker, or optionally use

>> >>> >> a

>> >>> >> flat hierarchy (i.e. default-rack).

>> >>> >>

>> >>> >> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx>

>> >>> >> wrote:

>> >>> >> > Hi There,

>> >>> >> >

>> >>> >> > I'm test driving Hadoop with CephFS as the storage layer. I was

>> >>> >> > running

>> >>> >> > the

>> >>> >> > Terasort benchmark and  I noticed a lot of network IO activity

>> >>> >> > when

>> >>> >> > compared

>> >>> >> > to a HDFS storage layer setup. (Its a half-a-terabyte sort

>> >>> >> > workload

>> >>> >> > over

>> >>> >> > two

>> >>> >> > data nodes.)

>> >>> >> >

>> >>> >> > Digging into the job tracker logs a little, I noticed that all

>> >>> >> > the

>> >>> >> > map

>> >>> >> > tasks

>> >>> >> > were being assigned to process a split (block)  on non-local

>> >>> >> > nodes

>> >>> >> > (which

>> >>> >> > explains all the network activity during the map phase)

>> >>> >> >

>> >>> >> > With Ceph:

>> >>> >> >

>> >>> >> >

>> >>> >> > 2013-07-08 11:19:53,535 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > Input

>> >>> >> > size for job job_201307081115_0001 = 500000000000. Number of

>> >>> >> > splits =

>> >>> >> > 7452

>> >>> >> > 2013-07-08 11:19:53,538 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > Job

>> >>> >> > job_201307081115_0001 initialized successfully with 7452 map

>> >>> >> > tasks

>> >>> >> > and

>> >>> >> > 32

>> >>> >> > reduce tasks.

>> >>> >> >

>> >>> >> > 2013-07-08 11:19:54,836 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > Choosing a non-local task task_201307081115_0001_m_000000

>> >>> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker:

>> >>> >> > Adding

>> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip

>> >>> >> > task_201307081115_0001_m_000000, for tracker

>> >>> >> > 'tracker_vega7250:localhost/127.0.0.1:35422'

>> >>> >> >

>> >>> >> > 2013-07-08 11:19:54,990 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > Choosing a non-local task task_201307081115_0001_m_000001

>> >>> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobTracker:

>> >>> >> > Adding

>> >>> >> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip

>> >>> >> > task_201307081115_0001_m_000001, for tracker

>> >>> >> > 'tracker_vega7249:localhost/127.0.0.1:36725'

>> >>> >> >

>> >>> >> > ... and so on.

>> >>> >> >

>> >>> >> > In comparison with HDFS, the job tracker logs looked something

>> >>> >> > like

>> >>> >> > this.

>> >>> >> > The maps tasks were being assigned to process data blocks on the

>> >>> >> > local

>> >>> >> > nodes.

>> >>> >> >

>> >>> >> > 2013-07-08 03:55:32,656 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > Input

>> >>> >> > size for job job_201307080351_0001 = 500000000000. Number of

>> >>> >> > splits =

>> >>> >> > 7452

>> >>> >> > 2013-07-08 03:55:32,657 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > tip:task_201307080351_0001_m_000000 has split on

>> >>> >> > node:/default-rack/vega7247

>> >>> >> > 2013-07-08 03:55:32,657 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > tip:task_201307080351_0001_m_000001 has split on

>> >>> >> > node:/default-rack/vega7247

>> >>> >> > 2013-07-08 03:55:34,474 INFO org.apache.hadoop.mapred.JobTracker:

>> >>> >> > Adding

>> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip

>> >>> >> > task_201307080351_0001_m_000000, for tracker

>> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'

>> >>> >> > 2013-07-08 03:55:34,475 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > Choosing data-local task task_201307080351_0001_m_000000

>> >>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobTracker:

>> >>> >> > Adding

>> >>> >> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip

>> >>> >> > task_201307080351_0001_m_000001, for tracker

>> >>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'

>> >>> >> > 2013-07-08 03:55:34,475 INFO

>> >>> >> > org.apache.hadoop.mapred.JobInProgress:

>> >>> >> > Choosing data-local task task_201307080351_0001_m_000001

>> >>> >> >

>> >>> >> > Version Info:

>> >>> >> > ceph version 0.61.4

>> >>> >> > hadoop 1.1.2

>> >>> >> >

>> >>> >> > Has anyone else run into this ?

>> >>> >> >

>> >>> >> > Thanks

>> >>> >> > KC

>> >>> >> >

>> >>> >> > _______________________________________________

>> >>> >> > ceph-users mailing list

>> >>> >> > ceph-users@xxxxxxxxxxxxxx

>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >>> >> >

>> >>> >

>> >>> >

>> >>

>> >>

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com