Re: Hadoop/Ceph and DFS IO tests

ker can <kercan74@xxxxxxxxx> · Wed, 10 Jul 2013 20:23:10 -0500

Ran the DFS IO write tests:

- Increasing the journal log size did not make any difference for me ... i guess the number i had set was sufficient. For the rest of the tests I kept it at a generous 10GB.

- Separating out the journal from the data disk did make a difference as expected. Unfortunately I currently do not have access to SSDs, so I had a separate disk for the journal for each data disk for now.

HDFS write numbers (7 disks/data node):

Average execution time: 236
Best execution time     : 219
Worst execution time   : 254

Ceph write numbers with journals + data on same disk (7 disks)

Average execution time: 494
Best execution time     : 468
Worst execution time   : 524

So ceph was about 2x slower for the average case when journal & data were on the same disk.

Now separating out the journal from data disk ... 

HDFS write numbers (3 disks/data node)
Average execution time: 466
Best execution time     : 426
Worst execution time   : 508

ceph write numbers (3 data disks/data node + 3 journal disks/data node)

Average execution time: 610
Best execution time     : 593
Worst execution time   : 635

So ceph was about 1.3x slower for the average case when journal & data are separated .. a 70% improvement over the case where journal + data are on the same disk - but still a bit off from the HDFS performance.    Not knowing of other ceph knobs I can play with, I'll have to leave it at that.  I'll seem if I  get some system profiling done to narrow down where we're spending time.

thanks y'all

On Wed, Jul 10, 2013 at 11:35 AM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:

On Wed, Jul 10, 2013 at 9:17 AM, ker can <kercan74@xxxxxxxxx> wrote:

>

> Seems like a good read ahead value that the ceph hadoop client can use as a

> default   !

Great, I'll add this tunable to the list of changes to be pushed into

next release.

> I'll look at the DFS write tests later today .... any tuning suggestions you

> can think of there. I was thinking of trying out increasing the journal size

> and separating out the journaling to a separate  disk.  Anything else ?

I expect that you will see a lot of improvement by moving the journal

to a separate physical device, so I would start there.

As for journal size tuning, I'm not completely sure, but there may be

an opportunity to optimize for Hadoop workloads. I think ceph.com/docs

has some general guidelines. Maybe someone more knowledgeable than me

can chime in on the trade-offs

>

> For hdfs dfsio read test:

>

> Average execution time: 258

> Best execution time: 149

> Worst exec time: 361

>

> For ceph with default read ahead setting:

>

> Average execution time: 316

> Best execution time: 296

> Worst execution time: 358

>

> For ceph with read ahead setting = 4193404

>

> Average execution time: 285

> Best execution time: 277

> Worst execution time: 294

This is looking pretty good. I'd really like to work on that best

execution time for Ceph. I wonder if there are any Hadoop profiling

tools... narrowing down where time is being taken up would be very

useful.

Thanks!

Noah

>

> I didn't set max bytes ... I guess the default is zero which means no max ?

> I tried increasing the readahead max periods to 8 .. didn't look like a good

> change.

>

> thanks !

>

>

>

>

> On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

> wrote:

>>

>> Hey KC,

>>

>> I wanted to follow up on this, but ran out of time yesterday. To set

>> the options in ceph.conf you can do something like

>>

>> [client]

>>     readahead min = blah

>>     readahead max bytes = blah

>>     readahead max periods = blah

>>

>> then, make just sure that your client is pointing to a ceph.conf with

>> these settings.

>>

>>

>> On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>> wrote:

>> > Yes, the libcephfs client. You should be able to adjust the settings

>> > without changing any code. The settings should be adjustable either by

>> > setting the config options in ceph.conf, or using the

>> > "ceph.conf.options" settings in Hadoop's core-site.xml.

>> >

>> > On Tue, Jul 9, 2013 at 4:26 PM, ker can <kercan74@xxxxxxxxx> wrote:

>> >> Makes sense.  I can try playing around with these settings  .... when

>> >> you're

>> >> saying client, would this be libcephfs.so ?

>> >>

>> >>

>> >>

>> >>

>> >>

>> >> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>

>> >> wrote:

>> >>>

>> >>> Greg pointed out the read-ahead client options. I would suggest

>> >>> fiddling with these settings. If things improve, we can put automatic

>> >>> configuration of these settings into the Hadoop client itself. At the

>> >>> very least, we should be able to see if it is the read-ahead that is

>> >>> causing performance problems.

>> >>>

>> >>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at

>> >>> _least_ this much.

>> >>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024

>> >>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple

>> >>> of file layout period (object size * num stripes)

>> >>>

>> >>> -Noah

>> >>>

>> >>>

>> >>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins

>> >>> <noah.watkins@xxxxxxxxxxx>

>> >>> wrote:

>> >>> >> Is the JNI interface still an issue or have we moved past that ?

>> >>> >

>> >>> > We haven't done much performance tuning with Hadoop, but I suspect

>> >>> > that the JNI interface is not a bottleneck.

>> >>> >

>> >>> > My very first thought about what might be causing slow read

>> >>> > performance is the read-ahead settings we use vs Hadoop. Hadoop

>> >>> > should

>> >>> > be performing big, efficient, block-size reads and caching these in

>> >>> > each map task. However, I think we are probably doing lots of small

>> >>> > reads on demand. That would certainly hurt performance.

>> >>> >

>> >>> > In fact, in CephInputStream.java I see we are doing buffer-sized

>> >>> > reads. Which, at least in my tree, turn out to be 4096 bytes :)

>> >>> >

>> >>> > So, there are two issues now. First, the C-Java barrier is being

>> >>> > cross

>> >>> > a lot (16K times for a 64MB block). That's probably not a huge

>> >>> > overhead, but it might be something. The second is read-ahead. I'm

>> >>> > not

>> >>> > sure how much read-ahead the libcephfs client is performing, but the

>> >>> > more round trips its doing the more overhead we would incur.

>> >>> >

>> >>> >

>> >>> >>

>> >>> >> thanks !

>> >>> >>

>> >>> >>

>> >>> >>

>> >>> >>

>> >>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote:

>> >>> >>>

>> >>> >>> For this particular test I turned off replication for both hdfs

>> >>> >>> and

>> >>> >>> ceph.

>> >>> >>> So there is just one copy of the data lying around.

>> >>> >>>

>> >>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep

>> >>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash

>> >>> >>> rjenkins

>> >>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0

>> >>> >>> crash_replay_interval 45

>> >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1

>> >>> >>> object_hash

>> >>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0

>> >>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash

>> >>> >>> rjenkins

>> >>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0

>> >>> >>>

>> >>> >>> From hdfs-site.xml:

>> >>> >>>

>> >>> >>>   <property>

>> >>> >>>     <name>dfs.replication</name>

>> >>> >>>     <value>1</value>

>> >>> >>>   </property>

>> >>> >>>

>> >>> >>>

>> >>> >>>

>> >>> >>>

>> >>> >>>

>> >>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins

>> >>> >>> <noah.watkins@xxxxxxxxxxx>

>> >>> >>> wrote:

>> >>> >>>>

>> >>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx>

>> >>> >>>> wrote:

>> >>> >>>> > hi Noah,

>> >>> >>>> >

>> >>> >>>> > while we're still on the hadoop topic ... I was also trying out

>> >>> >>>> > the

>> >>> >>>> > TestDFSIO tests ceph v/s hadoop.  The Read tests on ceph takes

>> >>> >>>> > about

>> >>> >>>> > 1.5x

>> >>> >>>> > the hdfs time.  The write tests are worse about ... 2.5x the

>> >>> >>>> > time

>> >>> >>>> > on

>> >>> >>>> > hdfs,

>> >>> >>>> > but I guess we have additional journaling overheads for the

>> >>> >>>> > writes

>> >>> >>>> > on

>> >>> >>>> > ceph.

>> >>> >>>> > But there should be no such overheads for the read  ?

>> >>> >>>>

>> >>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could

>> >>> >>>> be

>> >>> >>>> the case that reads are slower because there is less opportunity

>> >>> >>>> for

>> >>> >>>> scheduling local reads. You can create a new pool with

>> >>> >>>> replication=3

>> >>> >>>> and test this out (documentation on how to do this is on

>> >>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).

>> >>> >>>>

>> >>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks,

>> >>> >>>> however

>> >>> >>>> Ceph will write all copies remotely, so there is some overhead

>> >>> >>>> for

>> >>> >>>> the

>> >>> >>>> extra remote object write  (compared to Hadoop), but i wouldn't

>> >>> >>>> have

>> >>> >>>> expected 2.5x. It might be useful to run dd or something like

>> >>> >>>> that on

>> >>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the

>> >>> >>>> bottleneck.

>> >>> >>>>

>> >>> >>>> -Noah

>> >>> >>>

>> >>> >>>

>> >>> >>

>> >>

>> >>

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com