Re: Hadoop/Ceph and DFS IO tests

Noah Watkins <noah.watkins@xxxxxxxxxxx> · Tue, 9 Jul 2013 16:32:41 -0700

Yes, the libcephfs client. You should be able to adjust the settings
without changing any code. The settings should be adjustable either by
setting the config options in ceph.conf, or using the
"ceph.conf.options" settings in Hadoop's core-site.xml.

On Tue, Jul 9, 2013 at 4:26 PM, ker can <kercan74@xxxxxxxxx> wrote:
> Makes sense.  I can try playing around with these settings  .... when you're
> saying client, would this be libcephfs.so ?
>
>
>
>
>
> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
> wrote:
>>
>> Greg pointed out the read-ahead client options. I would suggest
>> fiddling with these settings. If things improve, we can put automatic
>> configuration of these settings into the Hadoop client itself. At the
>> very least, we should be able to see if it is the read-ahead that is
>> causing performance problems.
>>
>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at
>> _least_ this much.
>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024
>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple
>> of file layout period (object size * num stripes)
>>
>> -Noah
>>
>>
>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> wrote:
>> >> Is the JNI interface still an issue or have we moved past that ?
>> >
>> > We haven't done much performance tuning with Hadoop, but I suspect
>> > that the JNI interface is not a bottleneck.
>> >
>> > My very first thought about what might be causing slow read
>> > performance is the read-ahead settings we use vs Hadoop. Hadoop should
>> > be performing big, efficient, block-size reads and caching these in
>> > each map task. However, I think we are probably doing lots of small
>> > reads on demand. That would certainly hurt performance.
>> >
>> > In fact, in CephInputStream.java I see we are doing buffer-sized
>> > reads. Which, at least in my tree, turn out to be 4096 bytes :)
>> >
>> > So, there are two issues now. First, the C-Java barrier is being cross
>> > a lot (16K times for a 64MB block). That's probably not a huge
>> > overhead, but it might be something. The second is read-ahead. I'm not
>> > sure how much read-ahead the libcephfs client is performing, but the
>> > more round trips its doing the more overhead we would incur.
>> >
>> >
>> >>
>> >> thanks !
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >>>
>> >>> For this particular test I turned off replication for both hdfs and
>> >>> ceph.
>> >>> So there is just one copy of the data lying around.
>> >>>
>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep
>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash
>> >>> rjenkins
>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0 crash_replay_interval 45
>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash
>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0
>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash
>> >>> rjenkins
>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0
>> >>>
>> >>> From hdfs-site.xml:
>> >>>
>> >>>   <property>
>> >>>     <name>dfs.replication</name>
>> >>>     <value>1</value>
>> >>>   </property>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins
>> >>> <noah.watkins@xxxxxxxxxxx>
>> >>> wrote:
>> >>>>
>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> >>>> > hi Noah,
>> >>>> >
>> >>>> > while we're still on the hadoop topic ... I was also trying out the
>> >>>> > TestDFSIO tests ceph v/s hadoop.  The Read tests on ceph takes
>> >>>> > about
>> >>>> > 1.5x
>> >>>> > the hdfs time.  The write tests are worse about ... 2.5x the time
>> >>>> > on
>> >>>> > hdfs,
>> >>>> > but I guess we have additional journaling overheads for the writes
>> >>>> > on
>> >>>> > ceph.
>> >>>> > But there should be no such overheads for the read  ?
>> >>>>
>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could be
>> >>>> the case that reads are slower because there is less opportunity for
>> >>>> scheduling local reads. You can create a new pool with replication=3
>> >>>> and test this out (documentation on how to do this is on
>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).
>> >>>>
>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks, however
>> >>>> Ceph will write all copies remotely, so there is some overhead for
>> >>>> the
>> >>>> extra remote object write  (compared to Hadoop), but i wouldn't have
>> >>>> expected 2.5x. It might be useful to run dd or something like that on
>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the
>> >>>> bottleneck.
>> >>>>
>> >>>> -Noah
>> >>>
>> >>>
>> >>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com