Re: Hadoop / Ceph and Data locality ?

ker can <kercan74@xxxxxxxxx> · Mon, 8 Jul 2013 20:22:42 -0500

hi Noah, okay I think the current version may have a problem haven't figured out where yet. Looking at the log messages and how the data blocks are distributed among the OSDs.

So, the job tracker log had for
example this output for the map task for the first split/block 0 – which it’s executing on host
vega7250.
....
....

2013-07-08 11:19:54,836 INFO
org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201307081115_0001_m_000000_0'
to tip task_201307081115_0001_m_000000, for tracker
'tracker_vega7250:localhost/127.0.0.1:35422'

...
...

If I look at how the blocks are
divided up among the OSDs, block 0 for example is managed by OSD#2 –
which is running on host vega7249. However our map task for block 0 is running
on another host.  Definitely not co-located. 

   FILE
OFFSET                   
OBJECT      
             OFFSET
  LENGTH       OSD

                      0     
10000000dbe.00000000            
0      67108864 
       2

67108864      10000000dbe.00000001   
          0     
67108864      13

134217728   
 10000000dbe.00000002            
0      67108864 
       5

201326592   
 10000000dbe.00000003            
0      67108864 
      4

….

….

Ceph osd tree:

 # id   
weight  type name       up/down reweight

-1     
14      root default

-3     
14             
rack unknownrack

-2     
7                      
host vega7249

0      
1                              
osd.0   up      1

1      
1                              
osd.1   up      1

2      
1                              
osd.2   up      1

3      
1                 
             osd.3  
up      1

4      
1                              
osd.4   up      1

5      
1                              
osd.5   up      1

6      
1                              
osd.6   up      1

-4     
7                      
host vega7250

10  
   1                              
osd.10  up      1

11     
1                              
osd.11  up      1

12     
1                              
osd.12  up      1

13     
1                              
osd.13  up      1

7      
1                   
           osd.7  
up      1

8      
1                              
osd.8   up      1

9      
1                              
osd.9   up      1

Thanks
KC

On Mon, Jul 8, 2013 at 3:36 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:

Yes, all of the code needed to get the locality information should be

present the version of the jar file you referenced. We have tested a

to make sure the right data is available, but have not extensively

tested that it is being used correctly by core Hadoop (e.g. that is

being correctly propagated out of CephFileSystem). IIRC fixing this

/should/ be pretty easy; fiddling with getFileBlockLocation.

On Mon, Jul 8, 2013 at 1:25 PM, ker can <kercan74@xxxxxxxxx> wrote:

> Hi Noah,

>

> I'm using the CephFS jar from ...

> http://ceph.com/download/hadoop-cephfs.jar

> I beleive this is built from hadoop-common/cephfs/branch-1.0 ?

>

> If thats the case, I should already be using an implementation thats got

> getFileBlockLocations() ... which is here

> https://github.com/ceph/hadoop-common/blob/cephfs/branch-1.0/src/core/org/apache/hadoop/fs/ceph/CephFileSystem.java

>

> Is there a command line tool that I can use to verify the results from

> getFileBlockLocations() ?

>

> thanks

> KC

>

>

>

> On Mon, Jul 8, 2013 at 3:09 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

> wrote:

>>

>> Hi KC,

>>

>> The locality information is now collected and available to Hadoop

>> through the CephFS API, so fixing this is certainly possible. However,

>> there has not been extensive testing. I think the tasks that need to

>> be completed are (1) make sure that `CephFileSystem` is encoding the

>> correct block location in `getFileBlockLocations` (which I think it is

>> currently completed, but does need to be verified), and (2) make sure

>> rack information is available in the jobtracker, or optionally use a

>> flat hierarchy (i.e. default-rack).

>>

>> On Mon, Jul 8, 2013 at 12:47 PM, ker can <kercan74@xxxxxxxxx> wrote:

>> > Hi There,

>> >

>> > I'm test driving Hadoop with CephFS as the storage layer. I was running

>> > the

>> > Terasort benchmark and  I noticed a lot of network IO activity when

>> > compared

>> > to a HDFS storage layer setup. (Its a half-a-terabyte sort workload over

>> > two

>> > data nodes.)

>> >

>> > Digging into the job tracker logs a little, I noticed that all the map

>> > tasks

>> > were being assigned to process a split (block)  on non-local nodes

>> > (which

>> > explains all the network activity during the map phase)

>> >

>> > With Ceph:

>> >

>> >

>> > 2013-07-08 11:19:53,535 INFO org.apache.hadoop.mapred.JobInProgress:

>> > Input

>> > size for job job_201307081115_0001 = 500000000000. Number of splits =

>> > 7452

>> > 2013-07-08 11:19:53,538 INFO org.apache.hadoop.mapred.JobInProgress: Job

>> > job_201307081115_0001 initialized successfully with 7452 map tasks and

>> > 32

>> > reduce tasks.

>> >

>> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobInProgress:

>> > Choosing a non-local task task_201307081115_0001_m_000000

>> > 2013-07-08 11:19:54,836 INFO org.apache.hadoop.mapred.JobTracker: Adding

>> > task (MAP) 'attempt_201307081115_0001_m_000000_0' to tip

>> > task_201307081115_0001_m_000000, for tracker

>> > 'tracker_vega7250:localhost/127.0.0.1:35422'

>> >

>> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobInProgress:

>> > Choosing a non-local task task_201307081115_0001_m_000001

>> > 2013-07-08 11:19:54,990 INFO org.apache.hadoop.mapred.JobTracker: Adding

>> > task (MAP) 'attempt_201307081115_0001_m_000001_0' to tip

>> > task_201307081115_0001_m_000001, for tracker

>> > 'tracker_vega7249:localhost/127.0.0.1:36725'

>> >

>> > ... and so on.

>> >

>> > In comparison with HDFS, the job tracker logs looked something like

>> > this.

>> > The maps tasks were being assigned to process data blocks on the local

>> > nodes.

>> >

>> > 2013-07-08 03:55:32,656 INFO org.apache.hadoop.mapred.JobInProgress:

>> > Input

>> > size for job job_201307080351_0001 = 500000000000. Number of splits =

>> > 7452

>> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress:

>> > tip:task_201307080351_0001_m_000000 has split on

>> > node:/default-rack/vega7247

>> > 2013-07-08 03:55:32,657 INFO org.apache.hadoop.mapred.JobInProgress:

>> > tip:task_201307080351_0001_m_000001 has split on

>> > node:/default-rack/vega7247

>> > 2013-07-08 03:55:34,474 INFO org.apache.hadoop.mapred.JobTracker: Adding

>> > task (MAP) 'attempt_201307080351_0001_m_000000_0' to tip

>> > task_201307080351_0001_m_000000, for tracker

>> > 'tracker_vega7247:localhost/127.0.0.1:43320'

>> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress:

>> > Choosing data-local task task_201307080351_0001_m_000000

>> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobTracker: Adding

>> > task (MAP) 'attempt_201307080351_0001_m_000001_0' to tip

>> > task_201307080351_0001_m_000001, for tracker

>> > 'tracker_vega7247:localhost/127.0.0.1:43320'

>> > 2013-07-08 03:55:34,475 INFO org.apache.hadoop.mapred.JobInProgress:

>> > Choosing data-local task task_201307080351_0001_m_000001

>> >

>> > Version Info:

>> > ceph version 0.61.4

>> > hadoop 1.1.2

>> >

>> > Has anyone else run into this ?

>> >

>> > Thanks

>> > KC

>> >

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com