Re: Hadoop/Ceph and DFS IO tests

ker can <kercan74@xxxxxxxxx> · Wed, 10 Jul 2013 11:17:29 -0500

hi noah,

Some results for the read tests: 

I set client_readahead_min=4193404 which is the default 
for hadoop dfs.datanode.readahead.bytes also.  I ran the dfsio test 6 
times each for HDFS, Ceph with default read ahead & ceph with 
readahead=4193404.  Setting read ahead in ceph did give about a 10% overall improvement over the default values. The hdfs average is only slightly better .... but then there was a lot more run to run variation for hdfs - perhaps some caching going there.  

Seems like a good read ahead value that the ceph hadoop client can use as a default   !

I'll look at the DFS write tests later today .... any tuning suggestions you can think of there. I was thinking of trying out increasing the journal size and separating out the journaling to a separate  disk.  Anything else ? 

For hdfs dfsio read test:

Average execution time: 258
Best execution time: 149
Worst exec time: 361

For ceph with default read ahead setting:

Average execution time: 316
Best execution time: 296

Worst execution time: 358

For ceph with read ahead setting = 4193404

Average execution time: 285
Best execution time: 277
Worst execution time: 294

I didn't set max bytes ... I guess the default is zero which means no max ? 

I tried increasing the readahead max periods to 8 .. didn't look like a good change.

thanks !

On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:

Hey KC,

I wanted to follow up on this, but ran out of time yesterday. To set

the options in ceph.conf you can do something like

[client]

    readahead min = blah

    readahead max bytes = blah

    readahead max periods = blah

then, make just sure that your client is pointing to a ceph.conf with

these settings.

On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:

> Yes, the libcephfs client. You should be able to adjust the settings

> without changing any code. The settings should be adjustable either by

> setting the config options in ceph.conf, or using the

> "ceph.conf.options" settings in Hadoop's core-site.xml.

>

> On Tue, Jul 9, 2013 at 4:26 PM, ker can <kercan74@xxxxxxxxx> wrote:

>> Makes sense.  I can try playing around with these settings  .... when you're

>> saying client, would this be libcephfs.so ?

>>

>>

>>

>>

>>

>> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>> wrote:

>>>

>>> Greg pointed out the read-ahead client options. I would suggest

>>> fiddling with these settings. If things improve, we can put automatic

>>> configuration of these settings into the Hadoop client itself. At the

>>> very least, we should be able to see if it is the read-ahead that is

>>> causing performance problems.

>>>

>>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at

>>> _least_ this much.

>>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024

>>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple

>>> of file layout period (object size * num stripes)

>>>

>>> -Noah

>>>

>>>

>>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>>> wrote:

>>> >> Is the JNI interface still an issue or have we moved past that ?

>>> >

>>> > We haven't done much performance tuning with Hadoop, but I suspect

>>> > that the JNI interface is not a bottleneck.

>>> >

>>> > My very first thought about what might be causing slow read

>>> > performance is the read-ahead settings we use vs Hadoop. Hadoop should

>>> > be performing big, efficient, block-size reads and caching these in

>>> > each map task. However, I think we are probably doing lots of small

>>> > reads on demand. That would certainly hurt performance.

>>> >

>>> > In fact, in CephInputStream.java I see we are doing buffer-sized

>>> > reads. Which, at least in my tree, turn out to be 4096 bytes :)

>>> >

>>> > So, there are two issues now. First, the C-Java barrier is being cross

>>> > a lot (16K times for a 64MB block). That's probably not a huge

>>> > overhead, but it might be something. The second is read-ahead. I'm not

>>> > sure how much read-ahead the libcephfs client is performing, but the

>>> > more round trips its doing the more overhead we would incur.

>>> >

>>> >

>>> >>

>>> >> thanks !

>>> >>

>>> >>

>>> >>

>>> >>

>>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote:

>>> >>>

>>> >>> For this particular test I turned off replication for both hdfs and

>>> >>> ceph.

>>> >>> So there is just one copy of the data lying around.

>>> >>>

>>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep

>>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash

>>> >>> rjenkins

>>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0 crash_replay_interval 45

>>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash

>>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0

>>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash

>>> >>> rjenkins

>>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0

>>> >>>

>>> >>> From hdfs-site.xml:

>>> >>>

>>> >>>   <property>

>>> >>>     <name>dfs.replication</name>

>>> >>>     <value>1</value>

>>> >>>   </property>

>>> >>>

>>> >>>

>>> >>>

>>> >>>

>>> >>>

>>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins

>>> >>> <noah.watkins@xxxxxxxxxxx>

>>> >>> wrote:

>>> >>>>

>>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx> wrote:

>>> >>>> > hi Noah,

>>> >>>> >

>>> >>>> > while we're still on the hadoop topic ... I was also trying out the

>>> >>>> > TestDFSIO tests ceph v/s hadoop.  The Read tests on ceph takes

>>> >>>> > about

>>> >>>> > 1.5x

>>> >>>> > the hdfs time.  The write tests are worse about ... 2.5x the time

>>> >>>> > on

>>> >>>> > hdfs,

>>> >>>> > but I guess we have additional journaling overheads for the writes

>>> >>>> > on

>>> >>>> > ceph.

>>> >>>> > But there should be no such overheads for the read  ?

>>> >>>>

>>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could be

>>> >>>> the case that reads are slower because there is less opportunity for

>>> >>>> scheduling local reads. You can create a new pool with replication=3

>>> >>>> and test this out (documentation on how to do this is on

>>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).

>>> >>>>

>>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks, however

>>> >>>> Ceph will write all copies remotely, so there is some overhead for

>>> >>>> the

>>> >>>> extra remote object write  (compared to Hadoop), but i wouldn't have

>>> >>>> expected 2.5x. It might be useful to run dd or something like that on

>>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the

>>> >>>> bottleneck.

>>> >>>>

>>> >>>> -Noah

>>> >>>

>>> >>>

>>> >>

>>

>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com