Ran the DFS IO write tests:
Best execution time : 219
Worst execution time : 254
Ceph write numbers with journals + data on same disk (7 disks)
Average execution time: 494
Best execution time : 468
Worst execution time : 524
Average execution time: 494
Best execution time : 468
Worst execution time : 524
So ceph was about 2x slower for the average case when journal & data were on the same disk.
Now separating out the journal from data disk ...
HDFS write numbers (3 disks/data node)
Average execution time: 466
Best execution time : 426
Worst execution time : 508
ceph write numbers (3 data disks/data node + 3 journal disks/data node)
Average execution time: 610
Best execution time : 593
Worst execution time : 635
HDFS write numbers (3 disks/data node)
Average execution time: 466
Best execution time : 426
Worst execution time : 508
ceph write numbers (3 data disks/data node + 3 journal disks/data node)
Average execution time: 610
Best execution time : 593
Worst execution time : 635
So ceph was about 1.3x slower for the average case when journal & data are separated .. a 70% improvement over the case where journal + data are on the same disk - but still a bit off from the HDFS performance. Not knowing of other ceph knobs I can play with, I'll have to leave it at that. I'll seem if I get some system profiling done to narrow down where we're spending time.
thanks y'all
On Wed, Jul 10, 2013 at 11:35 AM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:
On Wed, Jul 10, 2013 at 9:17 AM, ker can <kercan74@xxxxxxxxx> wrote:Great, I'll add this tunable to the list of changes to be pushed into
>
> Seems like a good read ahead value that the ceph hadoop client can use as a
> default !
next release.
I expect that you will see a lot of improvement by moving the journal
> I'll look at the DFS write tests later today .... any tuning suggestions you
> can think of there. I was thinking of trying out increasing the journal size
> and separating out the journaling to a separate disk. Anything else ?
to a separate physical device, so I would start there.
As for journal size tuning, I'm not completely sure, but there may be
an opportunity to optimize for Hadoop workloads. I think ceph.com/docs
has some general guidelines. Maybe someone more knowledgeable than me
can chime in on the trade-offs
This is looking pretty good. I'd really like to work on that best
>
> For hdfs dfsio read test:
>
> Average execution time: 258
> Best execution time: 149
> Worst exec time: 361
>
> For ceph with default read ahead setting:
>
> Average execution time: 316
> Best execution time: 296
> Worst execution time: 358
>
> For ceph with read ahead setting = 4193404
>
> Average execution time: 285
> Best execution time: 277
> Worst execution time: 294
execution time for Ceph. I wonder if there are any Hadoop profiling
tools... narrowing down where time is being taken up would be very
useful.
Thanks!
Noah
>
> I didn't set max bytes ... I guess the default is zero which means no max ?
> I tried increasing the readahead max periods to 8 .. didn't look like a good
> change.
>
> thanks !
>
>
>
>
> On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
> wrote:
>>
>> Hey KC,
>>
>> I wanted to follow up on this, but ran out of time yesterday. To set
>> the options in ceph.conf you can do something like
>>
>> [client]
>> readahead min = blah
>> readahead max bytes = blah
>> readahead max periods = blah
>>
>> then, make just sure that your client is pointing to a ceph.conf with
>> these settings.
>>
>>
>> On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> wrote:
>> > Yes, the libcephfs client. You should be able to adjust the settings
>> > without changing any code. The settings should be adjustable either by
>> > setting the config options in ceph.conf, or using the
>> > "ceph.conf.options" settings in Hadoop's core-site.xml.
>> >
>> > On Tue, Jul 9, 2013 at 4:26 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >> Makes sense. I can try playing around with these settings .... when
>> >> you're
>> >> saying client, would this be libcephfs.so ?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> >> wrote:
>> >>>
>> >>> Greg pointed out the read-ahead client options. I would suggest
>> >>> fiddling with these settings. If things improve, we can put automatic
>> >>> configuration of these settings into the Hadoop client itself. At the
>> >>> very least, we should be able to see if it is the read-ahead that is
>> >>> causing performance problems.
>> >>>
>> >>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at
>> >>> _least_ this much.
>> >>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024
>> >>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple
>> >>> of file layout period (object size * num stripes)
>> >>>
>> >>> -Noah
>> >>>
>> >>>
>> >>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins
>> >>> <noah.watkins@xxxxxxxxxxx>
>> >>> wrote:
>> >>> >> Is the JNI interface still an issue or have we moved past that ?
>> >>> >
>> >>> > We haven't done much performance tuning with Hadoop, but I suspect
>> >>> > that the JNI interface is not a bottleneck.
>> >>> >
>> >>> > My very first thought about what might be causing slow read
>> >>> > performance is the read-ahead settings we use vs Hadoop. Hadoop
>> >>> > should
>> >>> > be performing big, efficient, block-size reads and caching these in
>> >>> > each map task. However, I think we are probably doing lots of small
>> >>> > reads on demand. That would certainly hurt performance.
>> >>> >
>> >>> > In fact, in CephInputStream.java I see we are doing buffer-sized
>> >>> > reads. Which, at least in my tree, turn out to be 4096 bytes :)
>> >>> >
>> >>> > So, there are two issues now. First, the C-Java barrier is being
>> >>> > cross
>> >>> > a lot (16K times for a 64MB block). That's probably not a huge
>> >>> > overhead, but it might be something. The second is read-ahead. I'm
>> >>> > not
>> >>> > sure how much read-ahead the libcephfs client is performing, but the
>> >>> > more round trips its doing the more overhead we would incur.
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> thanks !
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >>> >>>
>> >>> >>> For this particular test I turned off replication for both hdfs
>> >>> >>> and
>> >>> >>> ceph.
>> >>> >>> So there is just one copy of the data lying around.
>> >>> >>>
>> >>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep
>> >>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash
>> >>> >>> rjenkins
>> >>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0
>> >>> >>> crash_replay_interval 45
>> >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1
>> >>> >>> object_hash
>> >>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0
>> >>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash
>> >>> >>> rjenkins
>> >>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0
>> >>> >>>
>> >>> >>> From hdfs-site.xml:
>> >>> >>>
>> >>> >>> <property>
>> >>> >>> <name>dfs.replication</name>
>> >>> >>> <value>1</value>
>> >>> >>> </property>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins
>> >>> >>> <noah.watkins@xxxxxxxxxxx>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx>
>> >>> >>>> wrote:
>> >>> >>>> > hi Noah,
>> >>> >>>> >
>> >>> >>>> > while we're still on the hadoop topic ... I was also trying out
>> >>> >>>> > the
>> >>> >>>> > TestDFSIO tests ceph v/s hadoop. The Read tests on ceph takes
>> >>> >>>> > about
>> >>> >>>> > 1.5x
>> >>> >>>> > the hdfs time. The write tests are worse about ... 2.5x the
>> >>> >>>> > time
>> >>> >>>> > on
>> >>> >>>> > hdfs,
>> >>> >>>> > but I guess we have additional journaling overheads for the
>> >>> >>>> > writes
>> >>> >>>> > on
>> >>> >>>> > ceph.
>> >>> >>>> > But there should be no such overheads for the read ?
>> >>> >>>>
>> >>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could
>> >>> >>>> be
>> >>> >>>> the case that reads are slower because there is less opportunity
>> >>> >>>> for
>> >>> >>>> scheduling local reads. You can create a new pool with
>> >>> >>>> replication=3
>> >>> >>>> and test this out (documentation on how to do this is on
>> >>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).
>> >>> >>>>
>> >>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks,
>> >>> >>>> however
>> >>> >>>> Ceph will write all copies remotely, so there is some overhead
>> >>> >>>> for
>> >>> >>>> the
>> >>> >>>> extra remote object write (compared to Hadoop), but i wouldn't
>> >>> >>>> have
>> >>> >>>> expected 2.5x. It might be useful to run dd or something like
>> >>> >>>> that on
>> >>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the
>> >>> >>>> bottleneck.
>> >>> >>>>
>> >>> >>>> -Noah
>> >>> >>>
>> >>> >>>
>> >>> >>
>> >>
>> >>
>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com