On Tue, Jan 17, 2012 at 1:19 PM, Andrey Stepachev <octo47@xxxxxxxxx> wrote: > 2012/1/17 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>: >> On Tue, Jan 17, 2012 at 11:37 AM, Andrey Stepachev <octo47@xxxxxxxxx> wrote: >>> 2012/1/17 Sage Weil <sage@xxxxxxxxxxxx>: >>>> On Mon, 16 Jan 2012, Andrey Stepachev wrote: >>>>> Ops. It is really a buffer problem. >>>>> It can be checked with easy with ceph osd tell 4 bench >>>>> >>>>> bench: wrote 1024 MB in blocks of 15625 KB in 17.115538 sec at 61264 KB/sec >>>>> bench: wrote 1024 MB in blocks of 122 MB in 12.281531 sec at 85378 KB/sec >>>>> bench: wrote 1024 MB in blocks of 244 MB in 13.529501 sec at 77502 KB/sec >>>>> bench: wrote 3814 MB in blocks of 488 MB in 30.909198 sec at 123 MB/sec >>>>> >>>>> and in last case dstat show 'iozone-like iops' >>>>> >>>>> 100: 100: 100: 100| 0 0 94 6 0 0| 0 538M| 0 238 >>>>> 100: 100: 100: 100| 0 0 96 3 0 0| 0 525M| 0 133 >>>>> 100: 100: 100: 100| 0 2 95 3 0 0| 0 497M| 0 128 >>>>> 18.0:39.0:30.0:27.0| 0 3 96 2 0 0| 0 144M| 0 40.0 >>>>> 75.0:74.0:72.0:67.0| 0 3 95 2 0 0| 0 103M| 0 2698 >>>>> 100: 100: 100: 100| 0 13 83 4 0 0| 0 484M| 0 125 >>>>> 100: 100: 100: 100| 0 0 100 0 0 0| 0 486M| 0 124 >>>>> 100: 100: 100: 100| 0 3 88 9 0 0| 0 476M| 0 123 >>>>> >>>>> Now question arises: how ceph can be tuned to gain such >>>>> performance in normal operations, not in bench? >>>> >>>> This may be related to how your OSD journaling is configured. I'm >>>> guessing it's set to a file inside the btrfs volume holding the data? >>> >>> Yes. After some investigation I found, that it is more related to >>> fact, that FileStore process incoming data in transactions. >>> So we have many relatively small transactions (~1-4Mb). >>> As opposite, in hadoop we have no transactions, we have >>> simple stream of packets, wich a written to a file. >> Ceph has a lot of design goals which mean it's not aimed quite as >> squarely at MapReduce as HDFS is, thus these sorts of differences. > > May be in the future it will be possible to make separate operation > like STREAM especially for hadoop-like workload. > But I can't propose some idea right now how to implement > such solution (in case of existing transactional system). > >> >> Can you make sure that your RADOS objects are 64MB in size? They >> should be set to that, but it's possible something in the path got >> broken. Otherwise we may need to set up configuration options for the >> size of things to send over the wire... :/ > > show_layout shows expected layout. > If I disable buffering in java, and rely only on ceph (and send > by 10MB), I've got such logs: > 000~10000000 = 10000000 > 2012-01-17 14:35:41.990987 7f4a6306c700 filestore(/data/osd.1) write > 0.17b_head/10000003c90.00000000/head/2c91397b 20000000~10000000 > 2012-01-17 14:35:41.997599 7f4a6306c700 filestore(/data/osd.1) > queue_flusher ep 5 fd 40 20000000~10000000 qlen 1 > 2012-01-17 14:35:41.997642 7f4a6306c700 filestore(/data/osd.1) write > 0.17b_head/10000003c90.00000000/head/2c91397b 20000000~10000000 = > 10000000 > 2012-01-17 14:35:42.192107 7f4a6286b700 filestore(/data/osd.1) write > 0.17b_head/10000003c90.00000000/head/2c91397b 30000000~10000000 > 2012-01-17 14:35:42.197386 7f4a6286b700 filestore(/data/osd.1) > queue_flusher ep 5 fd 39 30000000~10000000 qlen 1 > 2012-01-17 14:35:42.197415 7f4a6286b700 filestore(/data/osd.1) write > 0.17b_head/10000003c90.00000000/head/2c91397b 30000000~10000000 = > 10000000 > 2012-01-17 14:35:42.603251 7f4a6306c700 filestore(/data/osd.1) write > 0.17b_head/10000003c90.00000000/head/2c91397b 40000000~10000000 > > Also, it is easy to see, that buffers comes with very close timestamps, > but queued as separate ops. FileStore issues one write for such op. > (in some cases one do_transactions process 2-3 op for one object). > This looks like we can optimize this by collocating op by object. > But lack of knowlege of internals of the ceph don't allow me > to implement such prototype right now. > > I could not find how I can influence the ceph to > merge incoming packets in one op. Merging ops like that in Ceph will be difficult (with its consistency semantics), though probably not impossible. But I think it's a big hammer for this particular problem, at least before we've tried some other things. :) For instance, there are a number of config options you can try tuning. journal_max_write_bytes: controls how much data will get sent into the OSD journal in a single op. It defaults to 10<<20 (slightly <10MB); you might try turning it up. client_oc_target_dirty: if the amount of dirty data on the client is larger than this, it will try and send some pages out pretty much right away. It defaults to 1024*1024*8 (ie, 8MB). This is one that I'd really recommend you try turning up — also, the associated client_oc_max_dirty, which defaults to 1024*1024*100. You can pass these in either as command-line arguments (--journal_max_write_bytes = n) or put them in the config files ("journal max write bytes = n"). Try changing them to a few different values and see if that makes things better? :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html