hi noah,
Some results for the read tests:
I set client_readahead_min=4193404 which is the default for hadoop dfs.datanode.readahead.bytes also. I ran the dfsio test 6 times each for HDFS, Ceph with default read ahead & ceph with readahead=4193404. Setting read ahead in ceph did give about a 10% overall improvement over the default values. The hdfs average is only slightly better .... but then there was a lot more run to run variation for hdfs - perhaps some caching going there.
Some results for the read tests:
I set client_readahead_min=4193404 which is the default for hadoop dfs.datanode.readahead.bytes also. I ran the dfsio test 6 times each for HDFS, Ceph with default read ahead & ceph with readahead=4193404. Setting read ahead in ceph did give about a 10% overall improvement over the default values. The hdfs average is only slightly better .... but then there was a lot more run to run variation for hdfs - perhaps some caching going there.
Seems like a good read ahead value that the ceph hadoop client can use as a default !
I'll look at the DFS write tests later today .... any tuning suggestions you can think of there. I was thinking of trying out increasing the journal size and separating out the journaling to a separate disk. Anything else ?
For hdfs dfsio read test:
Average execution time: 258
Best execution time: 149
Worst exec time: 361
For ceph with default read ahead setting:
Average execution time: 316
Best execution time: 296
Worst execution time: 358
For ceph with read ahead setting = 4193404
Average execution time: 285
Best execution time: 277
Worst execution time: 294
thanks !
On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:
Hey KC,
I wanted to follow up on this, but ran out of time yesterday. To set
the options in ceph.conf you can do something like
[client]
readahead min = blah
readahead max bytes = blah
readahead max periods = blah
then, make just sure that your client is pointing to a ceph.conf with
these settings.
On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> wrote:
> Yes, the libcephfs client. You should be able to adjust the settings
> without changing any code. The settings should be adjustable either by
> setting the config options in ceph.conf, or using the
> "ceph.conf.options" settings in Hadoop's core-site.xml.
>
> On Tue, Jul 9, 2013 at 4:26 PM, ker can <kercan74@xxxxxxxxx> wrote:
>> Makes sense. I can try playing around with these settings .... when you're
>> saying client, would this be libcephfs.so ?
>>
>>
>>
>>
>>
>> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>> wrote:
>>>
>>> Greg pointed out the read-ahead client options. I would suggest
>>> fiddling with these settings. If things improve, we can put automatic
>>> configuration of these settings into the Hadoop client itself. At the
>>> very least, we should be able to see if it is the read-ahead that is
>>> causing performance problems.
>>>
>>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at
>>> _least_ this much.
>>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024
>>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple
>>> of file layout period (object size * num stripes)
>>>
>>> -Noah
>>>
>>>
>>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx>
>>> wrote:
>>> >> Is the JNI interface still an issue or have we moved past that ?
>>> >
>>> > We haven't done much performance tuning with Hadoop, but I suspect
>>> > that the JNI interface is not a bottleneck.
>>> >
>>> > My very first thought about what might be causing slow read
>>> > performance is the read-ahead settings we use vs Hadoop. Hadoop should
>>> > be performing big, efficient, block-size reads and caching these in
>>> > each map task. However, I think we are probably doing lots of small
>>> > reads on demand. That would certainly hurt performance.
>>> >
>>> > In fact, in CephInputStream.java I see we are doing buffer-sized
>>> > reads. Which, at least in my tree, turn out to be 4096 bytes :)
>>> >
>>> > So, there are two issues now. First, the C-Java barrier is being cross
>>> > a lot (16K times for a 64MB block). That's probably not a huge
>>> > overhead, but it might be something. The second is read-ahead. I'm not
>>> > sure how much read-ahead the libcephfs client is performing, but the
>>> > more round trips its doing the more overhead we would incur.
>>> >
>>> >
>>> >>
>>> >> thanks !
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote:
>>> >>>
>>> >>> For this particular test I turned off replication for both hdfs and
>>> >>> ceph.
>>> >>> So there is just one copy of the data lying around.
>>> >>>
>>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep
>>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash
>>> >>> rjenkins
>>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0 crash_replay_interval 45
>>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash
>>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0
>>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash
>>> >>> rjenkins
>>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0
>>> >>>
>>> >>> From hdfs-site.xml:
>>> >>>
>>> >>> <property>
>>> >>> <name>dfs.replication</name>
>>> >>> <value>1</value>
>>> >>> </property>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins
>>> >>> <noah.watkins@xxxxxxxxxxx>
>>> >>> wrote:
>>> >>>>
>>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx> wrote:
>>> >>>> > hi Noah,
>>> >>>> >
>>> >>>> > while we're still on the hadoop topic ... I was also trying out the
>>> >>>> > TestDFSIO tests ceph v/s hadoop. The Read tests on ceph takes
>>> >>>> > about
>>> >>>> > 1.5x
>>> >>>> > the hdfs time. The write tests are worse about ... 2.5x the time
>>> >>>> > on
>>> >>>> > hdfs,
>>> >>>> > but I guess we have additional journaling overheads for the writes
>>> >>>> > on
>>> >>>> > ceph.
>>> >>>> > But there should be no such overheads for the read ?
>>> >>>>
>>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could be
>>> >>>> the case that reads are slower because there is less opportunity for
>>> >>>> scheduling local reads. You can create a new pool with replication=3
>>> >>>> and test this out (documentation on how to do this is on
>>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).
>>> >>>>
>>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks, however
>>> >>>> Ceph will write all copies remotely, so there is some overhead for
>>> >>>> the
>>> >>>> extra remote object write (compared to Hadoop), but i wouldn't have
>>> >>>> expected 2.5x. It might be useful to run dd or something like that on
>>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the
>>> >>>> bottleneck.
>>> >>>>
>>> >>>> -Noah
>>> >>>
>>> >>>
>>> >>
>>
>>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com