> Is the JNI interface still an issue or have we moved past that ? We haven't done much performance tuning with Hadoop, but I suspect that the JNI interface is not a bottleneck. My very first thought about what might be causing slow read performance is the read-ahead settings we use vs Hadoop. Hadoop should be performing big, efficient, block-size reads and caching these in each map task. However, I think we are probably doing lots of small reads on demand. That would certainly hurt performance. In fact, in CephInputStream.java I see we are doing buffer-sized reads. Which, at least in my tree, turn out to be 4096 bytes :) So, there are two issues now. First, the C-Java barrier is being cross a lot (16K times for a 64MB block). That's probably not a huge overhead, but it might be something. The second is read-ahead. I'm not sure how much read-ahead the libcephfs client is performing, but the more round trips its doing the more overhead we would incur. > > thanks ! > > > > > On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote: >> >> For this particular test I turned off replication for both hdfs and ceph. >> So there is just one copy of the data lying around. >> >> hadoop@vega7250:~$ ceph osd dump | grep rep >> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins >> pg_num 960 pgp_num 960 last_change 26 owner 0 crash_replay_interval 45 >> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash >> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0 >> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins >> pg_num 960 pgp_num 960 last_change 1 owner 0 >> >> From hdfs-site.xml: >> >> <property> >> <name>dfs.replication</name> >> <value>1</value> >> </property> >> >> >> >> >> >> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> >> wrote: >>> >>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx> wrote: >>> > hi Noah, >>> > >>> > while we're still on the hadoop topic ... I was also trying out the >>> > TestDFSIO tests ceph v/s hadoop. The Read tests on ceph takes about >>> > 1.5x >>> > the hdfs time. The write tests are worse about ... 2.5x the time on >>> > hdfs, >>> > but I guess we have additional journaling overheads for the writes on >>> > ceph. >>> > But there should be no such overheads for the read ? >>> >>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could be >>> the case that reads are slower because there is less opportunity for >>> scheduling local reads. You can create a new pool with replication=3 >>> and test this out (documentation on how to do this is on >>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/). >>> >>> As for writes, Hadoop will write 2 remote and 1 local blocks, however >>> Ceph will write all copies remotely, so there is some overhead for the >>> extra remote object write (compared to Hadoop), but i wouldn't have >>> expected 2.5x. It might be useful to run dd or something like that on >>> Ceph to see if the numbers make sense to rule out Hadoop as the >>> bottleneck. >>> >>> -Noah >> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com