Re: Hadoop/Ceph and DFS IO tests

Noah Watkins <noah.watkins@xxxxxxxxxxx> · Tue, 9 Jul 2013 15:27:49 -0700

> Is the JNI interface still an issue or have we moved past that ?

We haven't done much performance tuning with Hadoop, but I suspect
that the JNI interface is not a bottleneck.

My very first thought about what might be causing slow read
performance is the read-ahead settings we use vs Hadoop. Hadoop should
be performing big, efficient, block-size reads and caching these in
each map task. However, I think we are probably doing lots of small
reads on demand. That would certainly hurt performance.

In fact, in CephInputStream.java I see we are doing buffer-sized
reads. Which, at least in my tree, turn out to be 4096 bytes :)

So, there are two issues now. First, the C-Java barrier is being cross
a lot (16K times for a 64MB block). That's probably not a huge
overhead, but it might be something. The second is read-ahead. I'm not
sure how much read-ahead the libcephfs client is performing, but the
more round trips its doing the more overhead we would incur.

>
> thanks !
>
>
>
>
> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote:
>>
>> For this particular test I turned off replication for both hdfs and ceph.
>> So there is just one copy of the data lying around.
>>
>> hadoop@vega7250:~$ ceph osd dump | grep rep
>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins
>> pg_num 960 pgp_num 960 last_change 26 owner 0 crash_replay_interval 45
>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash
>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0
>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins
>> pg_num 960 pgp_num 960 last_change 1 owner 0
>>
>> From hdfs-site.xml:
>>
>>   <property>
>>     <name>dfs.replication</name>
>>     <value>1</value>
>>   </property>
>>
>>
>>
>>
>>
>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> wrote:
>>>
>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx> wrote:
>>> > hi Noah,
>>> >
>>> > while we're still on the hadoop topic ... I was also trying out the
>>> > TestDFSIO tests ceph v/s hadoop.  The Read tests on ceph takes about
>>> > 1.5x
>>> > the hdfs time.  The write tests are worse about ... 2.5x the time on
>>> > hdfs,
>>> > but I guess we have additional journaling overheads for the writes on
>>> > ceph.
>>> > But there should be no such overheads for the read  ?
>>>
>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could be
>>> the case that reads are slower because there is less opportunity for
>>> scheduling local reads. You can create a new pool with replication=3
>>> and test this out (documentation on how to do this is on
>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).
>>>
>>> As for writes, Hadoop will write 2 remote and 1 local blocks, however
>>> Ceph will write all copies remotely, so there is some overhead for the
>>> extra remote object write  (compared to Hadoop), but i wouldn't have
>>> expected 2.5x. It might be useful to run dd or something like that on
>>> Ceph to see if the numbers make sense to rule out Hadoop as the
>>> bottleneck.
>>>
>>> -Noah
>>
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com