2012/1/17 Sage Weil <sage@xxxxxxxxxxxx>: > On Mon, 16 Jan 2012, Andrey Stepachev wrote: >> Ops. It is really a buffer problem. >> It can be checked with easy with ceph osd tell 4 bench >> >> bench: wrote 1024 MB in blocks of 15625 KB in 17.115538 sec at 61264 KB/sec >> bench: wrote 1024 MB in blocks of 122 MB in 12.281531 sec at 85378 KB/sec >> bench: wrote 1024 MB in blocks of 244 MB in 13.529501 sec at 77502 KB/sec >> bench: wrote 3814 MB in blocks of 488 MB in 30.909198 sec at 123 MB/sec >> >> and in last case dstat show 'iozone-like iops' >> >> 100: 100: 100: 100| 0 0 94 6 0 0| 0 538M| 0 238 >> 100: 100: 100: 100| 0 0 96 3 0 0| 0 525M| 0 133 >> 100: 100: 100: 100| 0 2 95 3 0 0| 0 497M| 0 128 >> 18.0:39.0:30.0:27.0| 0 3 96 2 0 0| 0 144M| 0 40.0 >> 75.0:74.0:72.0:67.0| 0 3 95 2 0 0| 0 103M| 0 2698 >> 100: 100: 100: 100| 0 13 83 4 0 0| 0 484M| 0 125 >> 100: 100: 100: 100| 0 0 100 0 0 0| 0 486M| 0 124 >> 100: 100: 100: 100| 0 3 88 9 0 0| 0 476M| 0 123 >> >> Now question arises: how ceph can be tuned to gain such >> performance in normal operations, not in bench? > > This may be related to how your OSD journaling is configured. I'm > guessing it's set to a file inside the btrfs volume holding the data? Yes. After some investigation I found, that it is more related to fact, that FileStore process incoming data in transactions. So we have many relatively small transactions (~1-4Mb). As opposite, in hadoop we have no transactions, we have simple stream of packets, wich a written to a file. To achieve good performance we can: a) of course move journal away (but I can't test such configuration, by hardware is limited to one big raid). b) think about possibility of collocation transaction for the same object into one bigger transaction. Even if we move journal away we'll still write data in small pieces (as test shows instead of 10-100s megabytes will write 1-4mb per request). I can completely wrong, but what I see in the code, and in logs suggests to me that I'm right. (BTW, do you missed my hadoop related patches, or something wrong with them?) > > sage > >> >> 2012/1/16 Andrey Stepachev <octo47@xxxxxxxxx>: >> > Hi all. >> > >> > Last week I've investigate the status for hadoop on ceph. >> > I create some patches to remove some bugs and crashes. >> > Looks like it works. Even hbase works on top. >> > >> > For reference all sources and patches are here >> > >> > https://github.com/octo47/hadoop-common/tree/branch-1.0-ceph >> > https://github.com/octo47/ceph/tree/v0.40-hadoop >> > >> > After YCSB and TestDSFIO work without crashes i start investigate >> > performance. >> > >> > I have 5node cluster with 4 sata disks. btrfs. 24core on each. >> > raid. iozone shows up to 520MB/s. >> > >> > Performance differs in 2-3 times. After some tests i see strange thing. >> > hadoop uses disk very close to iozone: small amount and iops and high >> > throughtput (same as iozone). >> > ceph uses very inefficient: huge amount of iops, up to 3 times less >> > throughtput (i think because of high amount of iops). >> > >> > hadoop dstat output: >> > sda--sdb--sdc--sdd- ----total-cpu-usage---- -dsk/total- --io/total- >> > util:util:util:util|usr sys idl wai hiq siq| read writ| read writ >> > 100: 100: 100: 100| 1 5 83 11 0 0| 0 529M| 0 247 >> > 100: 100: 100: 100| 1 0 83 16 0 0| 0 542M| 0 168 >> > 100: 100: 100: 100| 1 0 81 18 0 0| 28k 518M|6.00 149 >> > 100: 100: 100: 100| 1 4 77 17 0 0| 0 533M| 0 243 >> > 100: 100: 100: 100| 1 3 83 13 0 0| 0 523M| 0 264 >> > >> > ceph dstat output: >> > =================================================== >> > sda--sdb--sdc--sdd- ----total-cpu-usage---- -dsk/total- --io/total- >> > util:util:util:util|usr sys idl wai hiq siq| read writ| read writ >> > 68.0:70.0:79.0:76.0| 1 2 93 4 0 0| 0 195M| 0 1723 >> > 86.0:85.0:93.0:91.0| 1 2 91 5 0 0| 0 226M| 0 1816 >> > 85.0:85.0:85.0:84.0| 1 3 92 4 0 0| 0 235M| 0 2316 >> > >> > >> > So, my question is: can someone point me: >> > a) can it be because of inefficient buffer size on osd part >> > (i tried to increase CephOutputStream buffer to 256kb, not helps) >> > b) what other problems can be and what options can i tune >> > to find out what is going on. >> > >> > PS: i can't use iozone on kernel mounted fs. something >> > hang in kernel, only reboot helps. >> > in /var/log/messages i see attached kern.log. >> > >> > >> > >> > -- >> > Andrey. >> >> >> >> -- >> Andrey. >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- Andrey. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html