Re: Hadoop / Ceph and Data locality ?

ker can <kercan74@xxxxxxxxx> · Mon, 8 Jul 2013 21:11:44 -0500

Yep, I'm running cuttlefish ... I'll try building out of that branch and let you know how that goes.

-KC

On Mon, Jul 8, 2013 at 9:06 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:

FYI, here is the patch as it currently stands:

https://github.com/ceph/hadoop-common/compare/cephfs;branch-1.0...cephfs;branch-1.0-topo

I have not tested it recently, but it looks like it should be close to

correct. Feel free to test it out--I won't be able to get to until

tomorrow or Wednesday.

Are you running Cuttlefish? I believe it has all the dependencies.

On Mon, Jul 8, 2013 at 7:00 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:

> KC,

>

> Thanks a lot for checking that out. I just went to investigate, and

> the work we have done on the locality/topology-aware features are

> sitting in a branch, and have not been merged into the tree that is

> used to produce the JAR file you are using. I will get that cleaned up

> and merged soon, and I think that should solve your problems :)

>

> -Noah

>

> On Mon, Jul 8, 2013 at 6:22 PM, ker can <kercan74@xxxxxxxxx> wrote:

>> hi Noah, okay I think the current version may have a problem haven't figured

>> out where yet. Looking at the log messages and how the data blocks are

>> distributed among the OSDs.

>>

>> So, the job tracker log had for example this output for the map task for the

>> first split/block 0 – which it’s executing on host vega7250.

>>

>> ....

>>

>> ....

>>

>> 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker: Adding

>> task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip

>> task_201307081115_0001_m_000000, for tracker

>> 'tracker_vega7250:localhost/127.0.0.1:35422'

>>

>> ...

>>

>> ...

>>

>>

>> If I look at how the blocks are divided up among the OSDs, block 0 for

>> example is managed by OSD#2 – which is running on host vega7249. However our

>> map task for block 0 is running on another host.  Definitely not co-located.

>>

>>

>>

>>    FILE OFFSET                    OBJECT                    OFFSET   LENGTH

>> OSD

>>

>>                         0      10000000dbe.00000000             0

>> 67108864         2

>>

>>        67108864      10000000dbe.00000001              0      67108864

>> 13

>>

>>       134217728     10000000dbe.00000002             0      67108864

>> 5

>>

>>       201326592     10000000dbe.00000003             0      67108864

>> 4

>>

>> ….

>>

>> ….

>>

>>

>>

>> Ceph osd tree:

>>

>>  # id    weight  type name       up/down reweight

>>

>> -1      14      root default

>>

>> -3      14              rack unknownrack

>>

>> -2      7                       host vega7249

>>

>> 0       1                               osd.0   up      1

>>

>> 1       1                               osd.1   up      1

>>

>> 2       1                               osd.2   up      1

>>

>> 3       1                               osd.3   up      1

>>

>> 4       1                               osd.4   up      1

>>

>> 5       1                               osd.5   up      1

>>

>> 6       1                               osd.6   up      1

>>

>> -4      7                       host vega7250

>>

>> 10      1                               osd.10  up      1

>>

>> 11      1                               osd.11  up      1

>>

>> 12      1                               osd.12  up      1

>>

>> 13      1                               osd.13  up      1

>>

>> 7       1                               osd.7   up      1

>>

>> 8       1                               osd.8   up      1

>>

>> 9       1                               osd.9   up      1

>>

>>

>> Thanks

>> KC

>>

>>

>> On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>> wrote:

>>>

>>> Yes, all of the code needed to get the locality information should be

>>> present the version of the jar file you referenced. We have tested a

>>> to make sure the right data is available, but have not extensively

>>> tested that it is being used correctly by core Hadoop (e.g. that is

>>> being correctly propagated out of CephFileSystem). IIRC fixing this

>>> /should/ be pretty easy; fiddling with getFileBlockLocation.

>>>

>>> On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote:

>>> > Hi Noah,

>>> >

>>> > I'm using the CephFS jar from ...

>>> > http://ceph.com/download/hadoop-cephfs.jar

>>> > I beleive this is built from hadoop-common/cephfs/branch-1.0 ?

>>> >

>>> > If thats the case, I should already be using an implementation thats got

>>> > getFileBlockLocations() ... which is here

>>> >

>>> > https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java

>>> >

>>> > Is there a command line tool that I can use to verify the results from

>>> > getFileBlockLocations() ?

>>> >

>>> > thanks

>>> > KC

>>> >

>>> >

>>> >

>>> > On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>>> > wrote:

>>> >>

>>> >> Hi KC,

>>> >>

>>> >> The locality information is now collected and available to Hadoop

>>> >> through the CephFS API, so fixing this is certainly possible. However,

>>> >> there has not been extensive testing. I think the tasks that need to

>>> >> be completed are (1) make sure that `CephFileSystem` is encoding the

>>> >> correct block location in `getFileBlockLocations` (which I think it is

>>> >> currently completed, but does need to be verified), and (2) make sure

>>> >> rack information is available in the jobtracker, or optionally use a

>>> >> flat hierarchy (i.e. default-rack).

>>> >>

>>> >> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx> wrote:

>>> >> > Hi There,

>>> >> >

>>> >> > I'm test driving Hadoop with CephFS as the storage layer. I was

>>> >> > running

>>> >> > the

>>> >> > Terasort benchmark and  I noticed a lot of network IO activity when

>>> >> > compared

>>> >> > to a HDFS storage layer setup. (Its a half-a-terabyte sort workload

>>> >> > over

>>> >> > two

>>> >> > data nodes.)

>>> >> >

>>> >> > Digging into the job tracker logs a little, I noticed that all the

>>> >> > map

>>> >> > tasks

>>> >> > were being assigned to process a split (block)  on non-local nodes

>>> >> > (which

>>> >> > explains all the network activity during the map phase)

>>> >> >

>>> >> > With Ceph:

>>> >> >

>>> >> >

>>> >> > 2013-07-08 11:19:53,535 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > Input

>>> >> > size for job job_201307081115_0001 = 500000000000. Number of splits =

>>> >> > 7452

>>> >> > 2013-07-08 11:19:53,538 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > Job

>>> >> > job_201307081115_0001 initialized successfully with 7452 map tasks

>>> >> > and

>>> >> > 32

>>> >> > reduce tasks.

>>> >> >

>>> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > Choosing a non-local task task_201307081115_0001_m_000000

>>> >> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker:

>>> >> > Adding

>>> >> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip

>>> >> > task_201307081115_0001_m_000000, for tracker

>>> >> > 'tracker_vega7250:localhost/127.0.0.1:35422'

>>> >> >

>>> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > Choosing a non-local task task_201307081115_0001_m_000001

>>> >> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobTracker:

>>> >> > Adding

>>> >> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip

>>> >> > task_201307081115_0001_m_000001, for tracker

>>> >> > 'tracker_vega7249:localhost/127.0.0.1:36725'

>>> >> >

>>> >> > ... and so on.

>>> >> >

>>> >> > In comparison with HDFS, the job tracker logs looked something like

>>> >> > this.

>>> >> > The maps tasks were being assigned to process data blocks on the

>>> >> > local

>>> >> > nodes.

>>> >> >

>>> >> > 2013-07-08 03:55:32,656 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > Input

>>> >> > size for job job_201307080351_0001 = 500000000000. Number of splits =

>>> >> > 7452

>>> >> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > tip:task_201307080351_0001_m_000000 has split on

>>> >> > node:/default-rack/vega7247

>>> >> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > tip:task_201307080351_0001_m_000001 has split on

>>> >> > node:/default-rack/vega7247

>>> >> > 2013-07-08 03:55:34,474 INFO org.apache.hadoop.mapred.JobTracker:

>>> >> > Adding

>>> >> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip

>>> >> > task_201307080351_0001_m_000000, for tracker

>>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'

>>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > Choosing data-local task task_201307080351_0001_m_000000

>>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobTracker:

>>> >> > Adding

>>> >> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip

>>> >> > task_201307080351_0001_m_000001, for tracker

>>> >> > 'tracker_vega7247:localhost/127.0.0.1:43320'

>>> >> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress:

>>> >> > Choosing data-local task task_201307080351_0001_m_000001

>>> >> >

>>> >> > Version Info:

>>> >> > ceph version 0.61.4

>>> >> > hadoop 1.1.2

>>> >> >

>>> >> > Has anyone else run into this ?

>>> >> >

>>> >> > Thanks

>>> >> > KC

>>> >> >

>>> >> > _______________________________________________

>>> >> > ceph-users mailing list

>>> >> > ceph-users@xxxxxxxxxxxxxx

>>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> >> >

>>> >

>>> >

>>

>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com