On Wed, Jul 10, 2013 at 9:17 AM, ker can <kercan74@xxxxxxxxx> wrote: > > Seems like a good read ahead value that the ceph hadoop client can use as a > default ! Great, I'll add this tunable to the list of changes to be pushed into next release. > I'll look at the DFS write tests later today .... any tuning suggestions you > can think of there. I was thinking of trying out increasing the journal size > and separating out the journaling to a separate disk. Anything else ? I expect that you will see a lot of improvement by moving the journal to a separate physical device, so I would start there. As for journal size tuning, I'm not completely sure, but there may be an opportunity to optimize for Hadoop workloads. I think ceph.com/docs has some general guidelines. Maybe someone more knowledgeable than me can chime in on the trade-offs > > For hdfs dfsio read test: > > Average execution time: 258 > Best execution time: 149 > Worst exec time: 361 > > For ceph with default read ahead setting: > > Average execution time: 316 > Best execution time: 296 > Worst execution time: 358 > > For ceph with read ahead setting = 4193404 > > Average execution time: 285 > Best execution time: 277 > Worst execution time: 294 This is looking pretty good. I'd really like to work on that best execution time for Ceph. I wonder if there are any Hadoop profiling tools... narrowing down where time is being taken up would be very useful. Thanks! Noah > > I didn't set max bytes ... I guess the default is zero which means no max ? > I tried increasing the readahead max periods to 8 .. didn't look like a good > change. > > thanks ! > > > > > On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <noah.watkins@xxxxxxxxxxx> > wrote: >> >> Hey KC, >> >> I wanted to follow up on this, but ran out of time yesterday. To set >> the options in ceph.conf you can do something like >> >> [client] >> readahead min = blah >> readahead max bytes = blah >> readahead max periods = blah >> >> then, make just sure that your client is pointing to a ceph.conf with >> these settings. >> >> >> On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> >> wrote: >> > Yes, the libcephfs client. You should be able to adjust the settings >> > without changing any code. The settings should be adjustable either by >> > setting the config options in ceph.conf, or using the >> > "ceph.conf.options" settings in Hadoop's core-site.xml. >> > >> > On Tue, Jul 9, 2013 at 4:26 PM, ker can <kercan74@xxxxxxxxx> wrote: >> >> Makes sense. I can try playing around with these settings .... when >> >> you're >> >> saying client, would this be libcephfs.so ? >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <noah.watkins@xxxxxxxxxxx> >> >> wrote: >> >>> >> >>> Greg pointed out the read-ahead client options. I would suggest >> >>> fiddling with these settings. If things improve, we can put automatic >> >>> configuration of these settings into the Hadoop client itself. At the >> >>> very least, we should be able to see if it is the read-ahead that is >> >>> causing performance problems. >> >>> >> >>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at >> >>> _least_ this much. >> >>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024 >> >>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple >> >>> of file layout period (object size * num stripes) >> >>> >> >>> -Noah >> >>> >> >>> >> >>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins >> >>> <noah.watkins@xxxxxxxxxxx> >> >>> wrote: >> >>> >> Is the JNI interface still an issue or have we moved past that ? >> >>> > >> >>> > We haven't done much performance tuning with Hadoop, but I suspect >> >>> > that the JNI interface is not a bottleneck. >> >>> > >> >>> > My very first thought about what might be causing slow read >> >>> > performance is the read-ahead settings we use vs Hadoop. Hadoop >> >>> > should >> >>> > be performing big, efficient, block-size reads and caching these in >> >>> > each map task. However, I think we are probably doing lots of small >> >>> > reads on demand. That would certainly hurt performance. >> >>> > >> >>> > In fact, in CephInputStream.java I see we are doing buffer-sized >> >>> > reads. Which, at least in my tree, turn out to be 4096 bytes :) >> >>> > >> >>> > So, there are two issues now. First, the C-Java barrier is being >> >>> > cross >> >>> > a lot (16K times for a 64MB block). That's probably not a huge >> >>> > overhead, but it might be something. The second is read-ahead. I'm >> >>> > not >> >>> > sure how much read-ahead the libcephfs client is performing, but the >> >>> > more round trips its doing the more overhead we would incur. >> >>> > >> >>> > >> >>> >> >> >>> >> thanks ! >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kercan74@xxxxxxxxx> wrote: >> >>> >>> >> >>> >>> For this particular test I turned off replication for both hdfs >> >>> >>> and >> >>> >>> ceph. >> >>> >>> So there is just one copy of the data lying around. >> >>> >>> >> >>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep >> >>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash >> >>> >>> rjenkins >> >>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0 >> >>> >>> crash_replay_interval 45 >> >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 >> >>> >>> object_hash >> >>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0 >> >>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash >> >>> >>> rjenkins >> >>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0 >> >>> >>> >> >>> >>> From hdfs-site.xml: >> >>> >>> >> >>> >>> <property> >> >>> >>> <name>dfs.replication</name> >> >>> >>> <value>1</value> >> >>> >>> </property> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins >> >>> >>> <noah.watkins@xxxxxxxxxxx> >> >>> >>> wrote: >> >>> >>>> >> >>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kercan74@xxxxxxxxx> >> >>> >>>> wrote: >> >>> >>>> > hi Noah, >> >>> >>>> > >> >>> >>>> > while we're still on the hadoop topic ... I was also trying out >> >>> >>>> > the >> >>> >>>> > TestDFSIO tests ceph v/s hadoop. The Read tests on ceph takes >> >>> >>>> > about >> >>> >>>> > 1.5x >> >>> >>>> > the hdfs time. The write tests are worse about ... 2.5x the >> >>> >>>> > time >> >>> >>>> > on >> >>> >>>> > hdfs, >> >>> >>>> > but I guess we have additional journaling overheads for the >> >>> >>>> > writes >> >>> >>>> > on >> >>> >>>> > ceph. >> >>> >>>> > But there should be no such overheads for the read ? >> >>> >>>> >> >>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it could >> >>> >>>> be >> >>> >>>> the case that reads are slower because there is less opportunity >> >>> >>>> for >> >>> >>>> scheduling local reads. You can create a new pool with >> >>> >>>> replication=3 >> >>> >>>> and test this out (documentation on how to do this is on >> >>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/). >> >>> >>>> >> >>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks, >> >>> >>>> however >> >>> >>>> Ceph will write all copies remotely, so there is some overhead >> >>> >>>> for >> >>> >>>> the >> >>> >>>> extra remote object write (compared to Hadoop), but i wouldn't >> >>> >>>> have >> >>> >>>> expected 2.5x. It might be useful to run dd or something like >> >>> >>>> that on >> >>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the >> >>> >>>> bottleneck. >> >>> >>>> >> >>> >>>> -Noah >> >>> >>> >> >>> >>> >> >>> >> >> >> >> >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com